Table of ContentsΒΆ
IntroductionΒΆ
Obesity is characterized as having a Body Mass Index (BMI), which is measured as weight divided by heigh squared, of at least 30. Having a high BMI like this, and being considered obese is considered a health risk, increasing chances for diabetes, heart disease, stroke, cancer, mental health issues and fertility issues. Another measure used by BMI is being considered overweight, which ranges from BMI values of 25 to 29.9, and being overweight carries similar health concerns, but at reduced rates. Unfortunately, in the United States, the combined rates of Americans that are obese or overweight have increased to encompass 70% of adult Americans, with obesity rates having tripled in the last 60 years.
To see why these rates of obesity and overweightness have increased, it would be beneficial to identify risk factors that can cause higher rates than normal. To do this, I will look at several factors that may play into these worrying rates. These include certain demographic variables, such as age, income, race, or education, as well as the price of food over time, and the rate of physical activity for American adults.
Data CollectionΒΆ
We will be looking at data from two datasets. The first one will be the Consumer Price Index (CPI) of food groups over time. The CPI of food is a price index, it looks at the average change in price over time, accounting for inflation. This dataset comes from the U.S. Department of Agriculture, and is all percent changes in CPI for different food groups from 1974 to 2022.
The second dataset comes from the U.S. Department of Health and Human Services. It is more complex than the other one, and is resultant from a survery taken across the country. The dataset has survery results from 2011 to 2022, taken from several different questions asked to several different demographic groups, as mentioned above in the introduction. Most data values are percent values of individuals that fall under criteria for each question.
#good for looking at data in tables and standard for data science
import pandas as pd
import numpy as np
#these help visualize the data through graphs
import matplotlib.pyplot as plt
import seaborn as sns
#these are statistical frameworks that help analyze the data in advanced methods
import statsmodels.api
import statsmodels.formula.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
Step 1
Reading the data is simple, as both datasets had csv files, which are easily read into dataframe with a simple method.
cpihist = pd.read_csv('./historicalcpi.csv')
cpihist.head()
| Consumer Price Index item | Year | Percent change | |
|---|---|---|---|
| 0 | All food | 1974 | 14.3 |
| 1 | All food | 1975 | 8.5 |
| 2 | All food | 1976 | 3.0 |
| 3 | All food | 1977 | 6.3 |
| 4 | All food | 1978 | 9.9 |
nut_phys_etc = pd.read_csv('Nutrition__Physical_Activity__and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System.csv')
nut_phys_etc.head()
| YearStart | YearEnd | LocationAbbr | LocationDesc | Datasource | Class | Topic | Question | Data_Value_Unit | Data_Value_Type | ... | GeoLocation | ClassID | TopicID | QuestionID | DataValueTypeID | LocationID | StratificationCategory1 | Stratification1 | StratificationCategoryId1 | StratificationID1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | 2020 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | NaN | Value | ... | NaN | PA | PA1 | Q047 | VALUE | 59 | Race/Ethnicity | Hispanic | RACE | RACEHIS |
| 1 | 2014 | 2014 | GU | Guam | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | NaN | Value | ... | (13.444304, 144.793731) | OWS | OWS1 | Q036 | VALUE | 66 | Education | High school graduate | EDU | EDUHSGRAD |
| 2 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | NaN | Value | ... | NaN | OWS | OWS1 | Q036 | VALUE | 59 | Income | $50,000 - $74,999 | INC | INC5075 |
| 3 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | NaN | Value | ... | NaN | OWS | OWS1 | Q037 | VALUE | 59 | Income | Data not reported | INC | INCNR |
| 4 | 2015 | 2015 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who achieve at least 300 min... | NaN | Value | ... | NaN | PA | PA1 | Q045 | VALUE | 59 | Income | Less than $15,000 | INC | INCLESS15 |
5 rows Γ 33 columns
Data CleaningΒΆ
To be able to actually work with the data from the two datasets, we will first need to alter the dataframes into how we want them. Looking at the CPI of food dataset, it seems straightforward and we will not need to work much on it. The survey data is much more complex, and we will need to change this dataset a lot to fit our needs.
Step 1
To get an idea of what the specifics of the data is, we will look at all possible elements for each column in the table. We will do this so that we know what data we want to bother working with, and how we may need to change how some data is represented in the table.
for c in nut_phys_etc.columns:
print(f"Unique values in {c} : {nut_phys_etc[c].unique()}")
Unique values in YearStart : [2020 2014 2013 2015 2012 2011 2017 2016 2018 2019 2022 2021] Unique values in YearEnd : [2020 2014 2013 2015 2012 2011 2017 2016 2018 2019 2022 2021] Unique values in LocationAbbr : ['US' 'GU' 'WY' 'DC' 'PR' 'AL' 'RI' 'DE' 'NJ' 'WA' 'ME' 'MI' 'VA' 'CA' 'UT' 'NY' 'MA' 'AR' 'IL' 'NH' 'NM' 'MD' 'MN' 'HI' 'LA' 'SD' 'TX' 'KY' 'WV' 'CO' 'OK' 'MS' 'OR' 'WI' 'KS' 'FL' 'ID' 'AZ' 'VI' 'MT' 'GA' 'NC' 'PA' 'ND' 'SC' 'NE' 'TN' 'MO' 'NV' 'IA' 'IN' 'OH' 'AK' 'VT' 'CT'] Unique values in LocationDesc : ['National' 'Guam' 'Wyoming' 'District of Columbia' 'Puerto Rico' 'Alabama' 'Rhode Island' 'Delaware' 'New Jersey' 'Washington' 'Maine' 'Michigan' 'Virginia' 'California' 'Utah' 'New York' 'Massachusetts' 'Arkansas' 'Illinois' 'New Hampshire' 'New Mexico' 'Maryland' 'Minnesota' 'Hawaii' 'Louisiana' 'South Dakota' 'Texas' 'Kentucky' 'West Virginia' 'Colorado' 'Oklahoma' 'Mississippi' 'Oregon' 'Wisconsin' 'Kansas' 'Florida' 'Idaho' 'Arizona' 'Virgin Islands' 'Montana' 'Georgia' 'North Carolina' 'Pennsylvania' 'North Dakota' 'South Carolina' 'Nebraska' 'Tennessee' 'Missouri' 'Nevada' 'Iowa' 'Indiana' 'Ohio' 'Alaska' 'Vermont' 'Connecticut'] Unique values in Datasource : ['Behavioral Risk Factor Surveillance System' 'BRFSS'] Unique values in Class : ['Physical Activity' 'Obesity / Weight Status' 'Fruits and Vegetables'] Unique values in Topic : ['Physical Activity - Behavior' 'Obesity / Weight Status' 'Fruits and Vegetables - Behavior'] Unique values in Question : ['Percent of adults who engage in no leisure-time physical activity' 'Percent of adults aged 18 years and older who have obesity' 'Percent of adults aged 18 years and older who have an overweight classification' 'Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)' 'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week' 'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)' 'Percent of adults who engage in muscle-strengthening activities on 2 or more days a week' 'Percent of adults who report consuming fruit less than one time daily' 'Percent of adults who report consuming vegetables less than one time daily'] Unique values in Data_Value_Unit : [nan] Unique values in Data_Value_Type : ['Value'] Unique values in Data_Value : [30.6 29.3 28.8 32.7 26.6 27.4 48.5 31.6 38.1 35.2 30.5 40.2 35.4 32.3 15.3 nan 33.7 32.1 19.8 35.1 36.1 27.9 42. 16.9 34.5 31.5 50.4 32.2 29.5 21.7 34.2 21.5 29.8 40.1 18.8 13.7 33. 25.9 20.1 36. 29.1 17.5 38.5 35.3 38. 27.5 44.3 28.3 32.5 35.9 24.2 28.5 25. 56.1 25.1 39. 37.4 34. 26.1 33.2 46.1 23.6 22.6 26. 10.9 42.7 35.5 29.9 8.8 22.1 28.1 21.1 35.8 24. 42.8 41.9 33.3 33.1 23.8 38.4 36.6 29.2 27.7 23.1 13.1 10.6 24.6 20. 31.9 15. 23.2 65.8 39.5 30.4 27.2 26.5 29.7 32.8 22.7 10.1 38.9 19.6 30.2 36.9 20.2 26.8 52.8 41.8 15.7 15.9 34.9 27.3 23. 37. 34.1 26.9 36.4 32. 12.5 38.2 41.4 34.3 30.8 25.5 28.9 39.2 28.4 30.7 20.7 9.2 30.9 19.1 25.4 38.7 30. 18.4 7. 38.6 14.1 14.5 28.7 8.3 25.6 23.5 9.4 37.2 22.9 32.6 30.3 37.1 8.5 60.5 39.1 39.8 37.7 25.3 4.5 35.7 37.6 39.4 17. 30.1 16.4 44.6 32.9 12.2 33.4 36.3 21.9 25.8 55.4 10.8 24.4 20.9 39.6 20.4 33.5 17.1 36.2 38.3 21.6 12.9 11.5 31.1 14.7 39.7 2.2 36.7 31.2 25.7 23.3 27.6 43.7 24.7 25.2 26.4 49.5 39.9 51.1 33.8 46.2 33.6 17.8 34.6 41.2 58.9 13.5 16.5 31. 48.1 9.7 52.7 40.9 48.4 29.6 33.9 22. 22.4 27. 31.4 34.8 27.1 14.6 4.4 10.4 15.6 26.7 42.9 44.8 40.4 16.7 40.8 28.2 24.8 43. 29.4 17.4 31.7 41.3 36.8 35. 18.7 14.8 16.3 32.4 41.5 41.1 22.8 44.5 43.9 22.3 20.3 29. 54.1 12.8 15.8 18.6 11.8 21.2 45.9 34.7 37.8 28.6 28. 36.5 12.4 15.4 31.8 40.7 42.1 16. 8.1 26.3 21.8 21.4 17.3 16.8 34.4 43.2 24.1 40. 46.3 11.6 40.6 37.9 31.3 41.7 43.1 19.5 19.3 26.2 9.6 54.2 53.6 45.7 14.2 27.8 43.4 42.3 35.6 17.2 14.9 24.9 37.3 21. 38.8 20.8 10.2 12.1 23.9 18.1 41. 16.2 44.4 37.5 54.5 44. 18.9 16.1 13.3 24.3 46.9 39.3 23.7 9.3 41.6 45.2 24.5 42.4 12.3 19.9 19.7 22.5 13.9 10.7 15.5 17.7 51. 40.3 18.2 43.6 15.1 18.3 47.9 42.2 50.9 5.9 16.6 9.8 7.3 45.5 44.1 19.2 49.6 48. 13.2 46. 6.9 51.3 19. 23.4 12.7 18.5 7.8 15.2 13. 18. 10.3 22.2 11.2 59.4 12.6 59.7 19.4 8.2 56.4 60.4 43.3 53.9 20.6 40.5 50.5 50.2 45.8 46.6 44.2 45.4 12. 54.4 20.5 47.5 7.7 43.5 48.9 53.2 42.6 62. 50.6 53.4 42.5 13.8 48.6 9.9 54.8 14.4 11.9 14.3 6.4 8.6 9.5 51.2 47.7 52.2 50. 21.3 2.6 45.6 13.6 11.7 13.4 47.6 55.2 58. 10.5 45.3 46.8 11.1 17.6 54. 47.3 52.5 49.4 57. 44.7 58.6 9.1 17.9 6.1 47.4 14. 48.2 47. 6.5 7.6 11.4 55.6 53.5 46.4 49. 50.3 11.3 8. 46.7 52.9 51.8 4.6 45.1 47.1 50.7 2.5 8.4 44.9 4.2 45. 8.7 50.1 47.2 60.8 51.9 51.5 3.6 43.8 53. 11. 50.8 6.7 47.8 49.8 49.1 10. 48.3 54.7 49.7 9. 6.2 7.5 46.5 5.7 52.6 49.2 48.7 52.4 55.7 49.9 54.9 56.9 56. 59.1 52.1 58.8 48.8 57.8 55. 59.5 65.2 58.7 51.6 55.5 54.3 62.6 7.4 63.6 51.4 56.5 53.7 63.4 59. 60.1 55.1 53.1 70.6 53.8 60.7 58.3 57.1 61.2 58.4 60. 59.6 59.3 56.8 55.9 71.6 52.3 49.3 52. 61.1 67.1 64.1 64.2 3.9 58.1 66.8 54.6 56.6 67.2 64.6 62.2 59.2 68.2 55.8 56.2 7.9 57.9 8.9 59.9 58.2 58.5 60.2 60.6 1.7 69.7 57.7 59.8 63.5 66.4 57.4 61.3 57.5 61.5 56.7 2.3 56.3 7.1 62.3 5.2 53.3 3.3 61.7 2.8 55.3 63. 6.6 61.9 51.7 65.1 60.9 62.4 63.1 63.3 4.1 71.4 61.4 73.1 62.5 5.8 62.8 61. 71.5 57.2 4. 2. 64.5 65.7 63.2 0.9 5.1 70. 64.7 62.1 63.7 5.6 61.6 65.6 61.8 64.3 57.6 66.9 65.3 66. 57.3 63.9 5.4 64.4 72.3 68.3 69.1 4.3 6.8 3.2 68.6 69.3 70.7 62.7 65. 68.9 5. 5.5 77.6 64.8 66.2 7.2 2.4 69. 6. 70.4 5.3 68. 70.9 66.3 60.3 3.8 66.1 67.8 71.2 64.9 64. 63.8 66.5 65.5 4.9 62.9 66.6 67.5 67.3 70.8 72.8 69.5 70.5 71.1 66.7 67.4 68.8 2.9 65.4 73.8 70.2 65.9 73.3 2.1 67.9 69.6 1.9 4.7 71.7 75.3 72.6 4.8 3. 6.3 67. 70.1 70.3 74.6 68.1 3.4] Unique values in Data_Value_Alt : [30.6 29.3 28.8 32.7 26.6 27.4 48.5 31.6 38.1 35.2 30.5 40.2 35.4 32.3 15.3 nan 33.7 32.1 19.8 35.1 36.1 27.9 42. 16.9 34.5 31.5 50.4 32.2 29.5 21.7 34.2 21.5 29.8 40.1 18.8 13.7 33. 25.9 20.1 36. 29.1 17.5 38.5 35.3 38. 27.5 44.3 28.3 32.5 35.9 24.2 28.5 25. 56.1 25.1 39. 37.4 34. 26.1 33.2 46.1 23.6 22.6 26. 10.9 42.7 35.5 29.9 8.8 22.1 28.1 21.1 35.8 24. 42.8 41.9 33.3 33.1 23.8 38.4 36.6 29.2 27.7 23.1 13.1 10.6 24.6 20. 31.9 15. 23.2 65.8 39.5 30.4 27.2 26.5 29.7 32.8 22.7 10.1 38.9 19.6 30.2 36.9 20.2 26.8 52.8 41.8 15.7 15.9 34.9 27.3 23. 37. 34.1 26.9 36.4 32. 12.5 38.2 41.4 34.3 30.8 25.5 28.9 39.2 28.4 30.7 20.7 9.2 30.9 19.1 25.4 38.7 30. 18.4 7. 38.6 14.1 14.5 28.7 8.3 25.6 23.5 9.4 37.2 22.9 32.6 30.3 37.1 8.5 60.5 39.1 39.8 37.7 25.3 4.5 35.7 37.6 39.4 17. 30.1 16.4 44.6 32.9 12.2 33.4 36.3 21.9 25.8 55.4 10.8 24.4 20.9 39.6 20.4 33.5 17.1 36.2 38.3 21.6 12.9 11.5 31.1 14.7 39.7 2.2 36.7 31.2 25.7 23.3 27.6 43.7 24.7 25.2 26.4 49.5 39.9 51.1 33.8 46.2 33.6 17.8 34.6 41.2 58.9 13.5 16.5 31. 48.1 9.7 52.7 40.9 48.4 29.6 33.9 22. 22.4 27. 31.4 34.8 27.1 14.6 4.4 10.4 15.6 26.7 42.9 44.8 40.4 16.7 40.8 28.2 24.8 43. 29.4 17.4 31.7 41.3 36.8 35. 18.7 14.8 16.3 32.4 41.5 41.1 22.8 44.5 43.9 22.3 20.3 29. 54.1 12.8 15.8 18.6 11.8 21.2 45.9 34.7 37.8 28.6 28. 36.5 12.4 15.4 31.8 40.7 42.1 16. 8.1 26.3 21.8 21.4 17.3 16.8 34.4 43.2 24.1 40. 46.3 11.6 40.6 37.9 31.3 41.7 43.1 19.5 19.3 26.2 9.6 54.2 53.6 45.7 14.2 27.8 43.4 42.3 35.6 17.2 14.9 24.9 37.3 21. 38.8 20.8 10.2 12.1 23.9 18.1 41. 16.2 44.4 37.5 54.5 44. 18.9 16.1 13.3 24.3 46.9 39.3 23.7 9.3 41.6 45.2 24.5 42.4 12.3 19.9 19.7 22.5 13.9 10.7 15.5 17.7 51. 40.3 18.2 43.6 15.1 18.3 47.9 42.2 50.9 5.9 16.6 9.8 7.3 45.5 44.1 19.2 49.6 48. 13.2 46. 6.9 51.3 19. 23.4 12.7 18.5 7.8 15.2 13. 18. 10.3 22.2 11.2 59.4 12.6 59.7 19.4 8.2 56.4 60.4 43.3 53.9 20.6 40.5 50.5 50.2 45.8 46.6 44.2 45.4 12. 54.4 20.5 47.5 7.7 43.5 48.9 53.2 42.6 62. 50.6 53.4 42.5 13.8 48.6 9.9 54.8 14.4 11.9 14.3 6.4 8.6 9.5 51.2 47.7 52.2 50. 21.3 2.6 45.6 13.6 11.7 13.4 47.6 55.2 58. 10.5 45.3 46.8 11.1 17.6 54. 47.3 52.5 49.4 57. 44.7 58.6 9.1 17.9 6.1 47.4 14. 48.2 47. 6.5 7.6 11.4 55.6 53.5 46.4 49. 50.3 11.3 8. 46.7 52.9 51.8 4.6 45.1 47.1 50.7 2.5 8.4 44.9 4.2 45. 8.7 50.1 47.2 60.8 51.9 51.5 3.6 43.8 53. 11. 50.8 6.7 47.8 49.8 49.1 10. 48.3 54.7 49.7 9. 6.2 7.5 46.5 5.7 52.6 49.2 48.7 52.4 55.7 49.9 54.9 56.9 56. 59.1 52.1 58.8 48.8 57.8 55. 59.5 65.2 58.7 51.6 55.5 54.3 62.6 7.4 63.6 51.4 56.5 53.7 63.4 59. 60.1 55.1 53.1 70.6 53.8 60.7 58.3 57.1 61.2 58.4 60. 59.6 59.3 56.8 55.9 71.6 52.3 49.3 52. 61.1 67.1 64.1 64.2 3.9 58.1 66.8 54.6 56.6 67.2 64.6 62.2 59.2 68.2 55.8 56.2 7.9 57.9 8.9 59.9 58.2 58.5 60.2 60.6 1.7 69.7 57.7 59.8 63.5 66.4 57.4 61.3 57.5 61.5 56.7 2.3 56.3 7.1 62.3 5.2 53.3 3.3 61.7 2.8 55.3 63. 6.6 61.9 51.7 65.1 60.9 62.4 63.1 63.3 4.1 71.4 61.4 73.1 62.5 5.8 62.8 61. 71.5 57.2 4. 2. 64.5 65.7 63.2 0.9 5.1 70. 64.7 62.1 63.7 5.6 61.6 65.6 61.8 64.3 57.6 66.9 65.3 66. 57.3 63.9 5.4 64.4 72.3 68.3 69.1 4.3 6.8 3.2 68.6 69.3 70.7 62.7 65. 68.9 5. 5.5 77.6 64.8 66.2 7.2 2.4 69. 6. 70.4 5.3 68. 70.9 66.3 60.3 3.8 66.1 67.8 71.2 64.9 64. 63.8 66.5 65.5 4.9 62.9 66.6 67.5 67.3 70.8 72.8 69.5 70.5 71.1 66.7 67.4 68.8 2.9 65.4 73.8 70.2 65.9 73.3 2.1 67.9 69.6 1.9 4.7 71.7 75.3 72.6 4.8 3. 6.3 67. 70.1 70.3 74.6 68.1 3.4] Unique values in Data_Value_Footnote_Symbol : [nan '~'] Unique values in Data_Value_Footnote : [nan 'Data not available because sample size is insufficient.'] Unique values in Low_Confidence_Limit : [29.4 25.7 28.1 31.9 25.6 18.6 32.3 24. 32.6 30.7 25.8 33.3 33. 6.9 nan 25.1 21.6 10.5 34.6 31.5 27.6 38.4 16. 33.6 27.8 30.5 46.4 28. 19.9 33.5 19. 26.9 36.5 17.1 9.7 29.7 24.5 17.9 32.9 16.3 36. 32.2 34.4 25.4 24.4 40.1 26.6 35.3 22.8 22. 51.7 28.7 23.2 34.7 34.8 31.4 33.2 23. 30.8 37.2 12.8 19.5 19.1 9.9 27.2 38.9 24.7 6.6 21.4 27.5 16.8 15.2 20.5 34.5 25. 32.8 41.4 30.9 37.6 29.5 19.8 25.2 26.7 29.9 25.9 21.7 9. 18.7 32.4 30.2 18.2 11.9 63.9 30.1 34.2 29. 22.4 22.5 32. 21. 8.5 37.9 17.5 29.6 10.6 28.5 11.4 29.2 26.2 49. 28.4 37.5 21.3 39. 12.5 18.4 9.4 23.4 31.3 30.4 31.2 7.5 31.7 22.6 29.8 26.4 23.9 35.4 27.4 34. 26.8 18.5 20.2 27.9 16.1 23.5 14.8 6.2 7.8 11.2 25.5 19.2 30.3 8.2 29.3 20. 12.7 13.3 47.6 28.2 35.1 6.5 23.8 56.8 38.6 35.5 35. 24.9 3.5 35.9 16.5 13.9 32.7 36.1 27.7 39.3 27.3 8.9 20.4 15.9 36.8 9.8 20.6 36.3 28.6 15.5 14.5 31.6 20.7 35.6 27.1 16.7 32.1 7. 18. 13.4 1.6 26.1 14.4 17.7 26. 18.8 21.8 24.3 48.8 41.8 10.1 38.5 22.2 58.4 28.9 38.3 23.6 21.1 23.3 10.3 22.9 33.7 28.8 37. 50.2 11.6 21.5 20.3 13.5 26.5 30.6 10.7 1.4 3.9 23.1 24.6 37.7 14.7 39.1 37.8 8.3 36.9 31.8 27. 22.1 25.3 10.8 23.7 26.3 10.9 15.3 20.8 17.6 36.4 6.7 15.1 24.1 15.4 19.6 38. 33.9 40.6 24.2 11.1 50.6 4.9 33.8 36.2 17.3 11.3 12.1 7.6 33.4 32.5 34.3 40. 19.3 18.9 20.1 17.8 8.8 38.8 11. 7.1 34.1 12.3 37.4 42.4 14. 35.2 24.8 21.2 29.1 43.6 18.3 41.3 17.4 16.2 13.7 6.1 18.1 51.9 50.3 36.6 11.5 28.3 12. 13.6 22.3 34.9 9.6 19.7 14.9 11.8 15.8 14.6 7.2 46.9 14.2 33.1 16.9 15.7 35.7 12.2 41. 40.5 31.1 30. 13.1 10. 22.7 20.9 12.6 16.6 19.4 45.2 5.6 40.7 39.9 40.3 8.6 9.3 9.5 42.7 36.7 12.9 17.2 46.7 4.6 14.3 39.6 13.2 6.8 8.1 2.6 8. 31. 39.5 43.3 8.4 42.3 16.4 49.2 15.6 38.1 12.4 13.8 10.2 35.8 9.1 58.9 53.7 39.8 13. 21.9 17. 6.3 39.4 50.5 55.9 51.6 44.2 42.1 41.6 15. 38.7 37.1 48.5 44.6 14.1 41.5 42.9 37.3 3.2 40.9 4.8 41.1 48. 11.7 52.6 6.4 40.4 44.3 52.3 5.8 39.2 3.3 7.3 3. 4.7 42.5 47.9 48.7 46. 42.8 0.8 41.2 9.2 44. 49.6 41.7 53.9 43.9 40.8 38.2 7.9 7.7 10.4 50.7 45.6 44.8 43.7 40.2 52.1 42.2 4.5 45.7 43. 4.3 3.6 48.6 45.5 43.5 44.1 5.7 8.7 4.2 45. 53.4 5.9 48.2 48.4 51.3 3.4 42.6 43.4 44.5 42. 0.7 53.1 4. 39.7 54.5 2.5 45.1 49.3 46.3 7.4 5.1 5. 2.4 43.8 5.2 2.9 5.4 45.3 41.9 52. 52.8 52.9 55. 48.9 47.1 47.2 59.1 56.2 46.1 58.3 54. 45.9 52.4 53.5 5.3 47.5 54.6 46.2 44.7 58.2 57.3 50. 49.4 49.5 44.9 1.7 51. 5.5 43.1 61.4 53. 48.3 60.2 54.2 57.2 53.6 68.1 51.2 47.4 44.4 55.2 46.6 47.7 53.3 49.8 49.9 2.2 55.6 57.6 53.2 55.7 50.4 52.2 51.1 69.5 49.1 46.8 45.8 47.8 51.4 52.5 43.2 55.3 47.3 49.7 53.8 47. 61.1 56.5 62.3 1. 62.6 48.1 51.5 64.7 62.1 57.4 66.4 52.7 57.8 54.7 56. 54.9 58.5 2.3 56.3 50.1 0.9 46.5 66.6 6. 56.6 60.4 56.4 63.3 59.5 55.4 62.7 59.9 3.7 45.4 58.1 54.1 61.8 61.2 57. 50.8 60.9 55.8 3.8 1.3 54.8 57.1 57.9 59. 59.7 1.2 0.6 60.8 4.4 0.3 50.9 66.5 55.5 62.4 57.5 59.6 59.3 51.8 54.3 65.2 64.1 60.6 58. 61.9 63. 62.2 56.1 62.8 54.4 55.1 56.9 66.7 58.8 60.7 67.2 63.1 64.3 2.7 3.1 64.6 56.7 2. 1.9 60.1 57.7 58.7 2.8 59.2 58.6 60.5 1.8 62.9 61.7 63.2 63.4 67.8 60. 61. 62.5 68. 59.8 60.3 64.2 61.5 65.6 64.8 59.4 61.6 63.8 70.2 66. 67.7 61.3 65. 1.5 4.1 64.9 67.3 64.5 65.1 65.4 0.4 63.6 66.8 67. 63.5 66.9 69. 62. 67.9 68.2 1.1] Unique values in High_Confidence_Limit : [31.8 33.3 29.5 33.5 27.6 38.5 64.9 40.4 43.8 40. 35.6 47.4 37.9 39.8 30.6 nan 43.4 44.8 34.4 35.7 41. 28.3 45.7 17.8 38.7 41.9 32.5 54.3 33. 31.1 23.7 32.6 35. 24.3 32.8 20.7 19.2 36.4 27.3 22.4 39.2 30.2 18.8 41.1 41.7 29.7 28.9 48.6 30.1 42.7 37.5 25.7 31.4 28.4 60.5 27.1 33.2 34.8 29.6 55.3 26.1 34.2 12. 52.3 46.6 11.6 22.8 28.6 35.5 41.5 27.8 51.5 28.2 38. 28.8 42.3 46.3 36.9 31.7 28.1 42.5 34.5 32.7 29.8 24.5 24.1 12.3 31.6 34. 31. 40.6 21.9 36.1 26.8 67.6 30.7 35.8 54.2 36.2 30.8 31.9 24.4 11.9 21.7 46.2 35.2 27.4 56.6 40.1 44.6 19.4 32.1 26.6 43. 38.1 34.9 29.2 38.4 20.2 49.8 37. 42.9 37.1 39.6 34.3 32. 27.2 35.3 52.2 43.2 32.2 35.9 31.2 23.2 11.3 23. 33.8 36.5 22.3 35.1 42.8 22.6 7.9 41.4 18.5 32.4 10.5 26.7 28.5 10.8 34.7 46. 44.4 16.5 37.4 45.8 19.3 49.3 39.1 11.1 26.2 64. 44.3 42.2 5.9 36. 43.1 17.5 30.5 19.1 40.3 43.6 50. 39. 16.6 44. 41.2 37.6 23.5 59.1 47.9 29. 21.2 26.3 33.9 38.3 20.1 48. 25.8 39.4 28. 44.9 23.1 22.5 18.6 30. 16.1 41.6 36.6 3.1 33.6 23.3 40.5 39.7 21.3 24.8 45.5 38.8 48.5 50.3 48.1 28.7 60.3 57. 29.4 17.4 39.9 32.9 59.5 45.3 36.7 37.2 24.7 32.3 17.7 38.2 42.1 49.7 11.2 45.9 55.1 49.2 38.9 24.6 45.6 36.3 29.3 45. 42.6 47.5 40.8 33.4 58.2 19.6 12.7 46.8 25.2 54.6 22. 43.3 56.4 62.7 37.8 26.5 25.9 51.6 30.3 51.1 39.3 37.3 48.2 43.9 54.9 47.8 40.9 17.3 29.1 43.5 44.1 44.2 44.7 25.5 49.1 35.4 47.2 54.7 41.3 36.8 30.9 38.6 72.5 23.4 33.1 14.4 54.8 20. 12.9 43.7 14.9 31.5 23.8 51.9 37.7 48.3 14.3 34.1 68.4 9.1 39.5 29.9 46.9 56.7 50.4 53.1 40.7 46.5 45.4 51. 57.6 53.3 15.1 21.6 25.6 17.1 26.4 24. 14.8 48.4 56.8 55. 21. 42. 16.8 50.9 46.1 21.4 46.7 41.8 33.7 51.4 42.4 22.1 11.4 18.7 22.2 50.6 26. 20.3 62. 52.8 18.4 51.3 15.3 40.2 61.6 58.5 55.4 34.6 31.3 45.1 11. 27.5 65.9 50.7 12.6 47.1 45.2 13.2 18.1 21.5 20.8 17.6 12.8 54.4 19. 19.9 20.4 27.9 13.5 23.9 51.7 19.7 22.7 13.7 13.9 53.2 25.1 7.5 27.7 27. 54.5 20.9 61. 52.5 48.8 22.9 55.9 54.1 21.1 30.4 17.2 21.8 53.4 53. 14.2 20.5 8.7 48.7 49. 15.8 52.1 56. 11.8 15.6 59.9 9. 13. 65.4 19.8 18. 10.4 24.9 25. 26.9 15.9 10.3 9.7 14.7 62.2 64.6 18.2 58.3 59.2 56.2 24.2 47.3 50.2 16.9 51.2 56.1 20.6 16.4 15.2 13.6 50.1 60.7 25.3 19.5 49.9 16.3 52.4 12.4 23.6 60.2 56.3 18.9 46.4 14.5 18.3 59.4 25.4 47. 17. 49.4 65.8 58.4 59.7 67.7 62.8 70.6 56.9 13.1 58. 52.7 76. 49.6 13.3 44.5 47.6 49.5 56.5 63. 9.9 15.5 66.3 55.8 55.6 69.1 14.1 53.9 15.4 12.5 8.3 14.6 17.9 64.1 47.7 51.8 15. 69.8 61.4 57.2 53.5 54. 13.4 74.9 61.2 57.4 73.3 63.2 15.7 7.7 60.9 50.5 53.8 16. 55.7 11.5 14. 62.4 48.9 16.7 60.8 70.1 16.2 65. 55.2 58.9 57.3 10.6 6.2 52.6 59.6 57.7 53.6 8.2 50.8 8.4 57.1 66.6 53.7 57.8 12.2 61.9 10.1 66.7 62.9 60.1 5.1 68.7 59. 70.2 52. 10.7 58.6 61.8 60.6 13.8 55.5 59.3 64.8 10.2 66.4 60.4 67.9 7.8 9.3 61.3 58.8 10.9 65.7 63.9 52.9 68.9 66.5 67.3 68.6 74.4 62.1 63.1 75.2 60. 58.7 57.5 63.6 68.1 58.1 59.8 57.9 72.9 12.1 9.8 11.7 62.3 63.4 65.1 64.2 62.5 66.2 61.1 73.7 68. 80.9 61.5 65.2 63.7 67. 65.3 66.1 76.2 70.7 69.5 74.2 69.9 70.5 6.1 75. 65.5 9.2 72.3 64.7 72.2 61.7 66.8 3.4 72.6 63.8 10. 70.3 63.5 5.4 69.4 3.7 71.2 65.6 7.6 62.6 64.5 7.3 4.4 73.1 4.7 66.9 73.2 71.8 67.4 68.3 66. 7.4 63.3 72. 6.4 9.5 4.6 84.3 72.4 69.3 75.8 71.9 75.5 72.1 73.6 84.8 6.9 73.8 64.4 3. 71.5 76.1 9.4 67.1 67.2 67.8 69.2 75.6 8.8 64.3 8.6 74.5 83.2 74.8 71.3 68.8 69. 72.8 73.9 84. 67.5 76.6 83.4 78.6 70. 86.8 8.1 68.5 70.9 6.7 69.7 74.7 77.8 6.3 71.1 80.7 84.2 6.6 79.1 71. 77.4 71.7 8.5 76.3 83. 72.7 5.3 5.8 70.4 7. 79.8 79.2 74.6 77. 74.1 5.7 68.2 76.9 78.9 75.9 79.9 6.8 73.4 71.6 79.6 81.6 8.9 69.6 74.3 5.5 9.6 80.1 82.4 83.9 77.1 7.2 76.4 85. 80. 71.4 75.1 5. 80.6 81.2 70.8 79.4 78.7 77.5 76.8 86.6 75.4 87.7 81.1 78.5 73.5 84.6 78.3 82.9 74. 78.1 75.7 8. 75.3 77.9 85.2] Unique values in Sample_Size : [31255. 842. 62562. ... 9468. 4640. 3650.] Unique values in Total : [nan 'Total'] Unique values in Age(years) : [nan '25 - 34' '55 - 64' '18 - 24' '45 - 54' '35 - 44' '65 or older'] Unique values in Education : [nan 'High school graduate' 'Less than high school' 'Some college or technical school' 'College graduate'] Unique values in Gender : [nan 'Female' 'Male'] Unique values in Income : [nan '$50,000 - $74,999' 'Data not reported' 'Less than $15,000' '$25,000 - $34,999' '$15,000 - $24,999' '$35,000 - $49,999' '$75,000 or greater'] Unique values in Race/Ethnicity : ['Hispanic' nan 'American Indian/Alaska Native' 'Asian' 'Non-Hispanic White' 'Other' '2 or more races' 'Hawaiian/Pacific Islander' 'Non-Hispanic Black'] Unique values in GeoLocation : [nan '(13.444304, 144.793731)' '(43.235541343, -108.109830353)' '(38.890371385, -77.031961127)' '(18.220833, -66.590149)' '(32.840571122, -86.631860762)' '(41.708280193, -71.522470314)' '(39.008830667000495, -75.57774116799965)' '(40.130570048, -74.273691288)' '(47.522278629, -120.47001079)' '(45.254228894000505, -68.98503133599962)' '(44.661319543001, -84.71439027)' '(37.542680674, -78.457890463)' '(37.638640123, -120.999999538)' '(39.360700171, -111.587130635)' '(42.827001032, -75.543970427)' '(42.27687047, -72.082690675)' '(39.008830667, -75.577741168)' '(34.748650124, -92.274490743)' '(40.485010283, -88.997710178)' '(43.655950113, -71.50036092)' '(34.520880952, -106.240580985)' '(39.29058096400047, -76.60926011099963)' '(46.35564873600049, -94.79420050299967)' '(21.304850435, -157.857749403)' '(31.312660644, -92.445680071)' '(44.353130053000484, -100.3735306369997)' '(31.827240407, -99.426770206)' '(37.645970271, -84.774971048)' '(44.353130053, -100.373530637)' '(43.23554134300048, -108.10983035299967)' '(38.665510202, -80.712640135)' '(38.843840757, -106.133610921)' '(35.472031356, -97.521070214)' '(32.745510099, -89.538030825)' '(44.567449424, -120.155031326)' '(44.393191174, -89.816370742)' '(38.3477403, -98.200781227)' '(28.932040377, -81.928960539)' '(43.682630005, -114.363730042)' '(45.254228894001, -68.985031336)' '(34.86597028, -111.763811277)' '(18.335765, -64.896335)' '(32.84057112200048, -86.63186076199969)' '(47.066528972, -109.424420645)' '(38.89037138500049, -77.03196112699965)' '(46.355648736, -94.794200503)' '(32.839681093, -83.627580346)' '(35.466220975, -79.159250463)' '(40.793730152, -77.860700294)' '(47.475319779, -100.118421049)' '(33.998821303, -81.045371207)' '(41.641040988001, -99.365720623)' '(35.68094058, -85.774490914)' '(38.635790776, -92.566300053)' '(39.49324039, -117.071840564)' '(31.827240407000488, -99.42677020599967)' '(39.360700171000474, -111.58713063499971)' '(39.290580964, -76.609260111)' '(42.469400913, -93.816490556)' '(39.766910452, -86.149960194)' '(32.83968109300048, -83.62758034599966)' '(42.46940091300047, -93.81649055599968)' '(42.27687047000046, -72.08269067499964)' '(40.060210141, -82.404260056)' '(64.845079957001, -147.722059036)' '(43.625381239, -72.517640791)' '(41.56266102, -72.649840952)' '(40.13057004800049, -74.27369128799967)' '(44.6613195430005, -84.71439026999968)' '(43.62538123900049, -72.51764079099962)' '(34.865970280000454, -111.76381127699972)' '(40.06021014100048, -82.40426005599966)' '(40.79373015200048, -77.86070029399963)' '(39.493240390000494, -117.07184056399967)' '(33.998821303000454, -81.04537120699968)' '(37.54268067400045, -78.45789046299967)' '(34.52088095200048, -106.24058098499967)' '(47.06652897200047, -109.42442064499971)' '(44.39319117400049, -89.81637074199966)' '(37.63864012300047, -120.99999953799971)' '(31.31266064400046, -92.44568007099969)' '(44.56744942400047, -120.15503132599969)' '(35.47203135600046, -97.52107021399968)' '(47.47531977900047, -100.11842104899966)' '(35.68094058000048, -85.77449091399967)' '(47.52227862900048, -120.47001078999972)' '(38.66551020200046, -80.71264013499967)' '(64.84507995700051, -147.72205903599973)' '(38.635790776000476, -92.56630005299968)' '(35.466220975000454, -79.15925046299964)' '(41.6410409880005, -99.36572062299967)' '(38.34774030000045, -98.20078122699965)' '(37.645970271000465, -84.77497104799966)' '(41.56266102000046, -72.64984095199964)' '(39.766910452000445, -86.14996019399968)' '(32.745510099000455, -89.53803082499968)' '(41.70828019300046, -71.52247031399963)' '(28.932040377000476, -81.92896053899966)' '(38.843840757000464, -106.13361092099967)' '(43.682630005000476, -114.3637300419997)' '(40.48501028300046, -88.99771017799969)' '(42.82700103200045, -75.54397042699964)' '(21.304850435000446, -157.85774940299973)' '(34.74865012400045, -92.27449074299966)' '(43.65595011300047, -71.50036091999965)'] Unique values in ClassID : ['PA' 'OWS' 'FV'] Unique values in TopicID : ['PA1' 'OWS1' 'FV1'] Unique values in QuestionID : ['Q047' 'Q036' 'Q037' 'Q045' 'Q044' 'Q043' 'Q046' 'Q018' 'Q019'] Unique values in DataValueTypeID : ['VALUE'] Unique values in LocationID : [59 66 56 11 72 1 44 10 34 53 23 26 51 6 49 36 25 5 17 33 35 24 27 15 22 46 48 21 54 8 40 28 41 55 20 12 16 4 78 30 13 37 42 38 45 31 47 29 32 19 18 39 2 50 9] Unique values in StratificationCategory1 : ['Race/Ethnicity' 'Education' 'Income' 'Age (years)' 'Gender' 'Total' nan] Unique values in Stratification1 : ['Hispanic' 'High school graduate' '$50,000 - $74,999' 'Data not reported' 'Less than $15,000' 'American Indian/Alaska Native' 'Less than high school' '$25,000 - $34,999' '25 - 34' 'Asian' 'Non-Hispanic White' 'Other' '55 - 64' 'Some college or technical school' 'Female' '18 - 24' '45 - 54' '$15,000 - $24,999' 'Total' 'Male' '35 - 44' '2 or more races' 'Hawaiian/Pacific Islander' 'College graduate' 'Non-Hispanic Black' '$35,000 - $49,999' '$75,000 or greater' '65 or older' nan] Unique values in StratificationCategoryId1 : ['RACE' 'EDU' 'INC' 'AGEYR' 'GEN' 'OVR' nan] Unique values in StratificationID1 : ['RACEHIS' 'EDUHSGRAD' 'INC5075' 'INCNR' 'INCLESS15' 'RACENAA' 'EDUHS' 'INC2535' 'AGEYR2534' 'RACEASN' 'RACEWHT' 'RACEOTH' 'AGEYR5564' 'EDUCOTEC' 'FEMALE' 'AGEYR1824' 'AGEYR4554' 'INC1525' 'OVERALL' 'MALE' 'AGEYR3544' 'RACE2PLUS' 'RACEHPI' 'EDUCOGRAD' 'RACEBLK' 'INC3550' 'INC75PLUS' 'AGEYR65PLUS' nan]
Step 2
There is a lot of data present in this table, but not all of it is necessary. To begin with, we will cut down the table's size by only including survery responses to questions that are relevant to our goal of identifying health risks. We just want to include the questions finding obesity and overweight rates, as well as inactive adults.
#the questions we care about
nut_list = ['Percent of adults who engage in no leisure-time physical activity',
'Percent of adults aged 18 years and older who have obesity',
'Percent of adults aged 18 years and older who have an overweight classification']
#we created a new filtered dataframe from the original bloated set
cleaned_ntr = nut_phys_etc[nut_phys_etc['Question'].isin(nut_list)]
print(cleaned_ntr)
YearStart YearEnd LocationAbbr LocationDesc \
0 2020 2020 US National
1 2014 2014 GU Guam
2 2013 2013 US National
3 2013 2013 US National
6 2012 2012 WY Wyoming
... ... ... ... ...
93244 2022 2022 WY Wyoming
93245 2022 2022 WY Wyoming
93246 2022 2022 WY Wyoming
93247 2022 2022 WY Wyoming
93248 2022 2022 WY Wyoming
Datasource Class \
0 Behavioral Risk Factor Surveillance System Physical Activity
1 Behavioral Risk Factor Surveillance System Obesity / Weight Status
2 Behavioral Risk Factor Surveillance System Obesity / Weight Status
3 Behavioral Risk Factor Surveillance System Obesity / Weight Status
6 Behavioral Risk Factor Surveillance System Obesity / Weight Status
... ... ...
93244 BRFSS Obesity / Weight Status
93245 BRFSS Physical Activity
93246 BRFSS Obesity / Weight Status
93247 BRFSS Obesity / Weight Status
93248 BRFSS Obesity / Weight Status
Topic \
0 Physical Activity - Behavior
1 Obesity / Weight Status
2 Obesity / Weight Status
3 Obesity / Weight Status
6 Obesity / Weight Status
... ...
93244 Obesity / Weight Status
93245 Physical Activity - Behavior
93246 Obesity / Weight Status
93247 Obesity / Weight Status
93248 Obesity / Weight Status
Question Data_Value_Unit \
0 Percent of adults who engage in no leisure-tim... NaN
1 Percent of adults aged 18 years and older who ... NaN
2 Percent of adults aged 18 years and older who ... NaN
3 Percent of adults aged 18 years and older who ... NaN
6 Percent of adults aged 18 years and older who ... NaN
... ... ...
93244 Percent of adults aged 18 years and older who ... NaN
93245 Percent of adults who engage in no leisure-tim... NaN
93246 Percent of adults aged 18 years and older who ... NaN
93247 Percent of adults aged 18 years and older who ... NaN
93248 Percent of adults aged 18 years and older who ... NaN
Data_Value_Type ... GeoLocation ClassID \
0 Value ... NaN PA
1 Value ... (13.444304, 144.793731) OWS
2 Value ... NaN OWS
3 Value ... NaN OWS
6 Value ... (43.235541343, -108.109830353) OWS
... ... ... ... ...
93244 Value ... (43.23554134300048, -108.10983035299967) OWS
93245 Value ... (43.23554134300048, -108.10983035299967) PA
93246 Value ... (43.23554134300048, -108.10983035299967) OWS
93247 Value ... (43.23554134300048, -108.10983035299967) OWS
93248 Value ... (43.23554134300048, -108.10983035299967) OWS
TopicID QuestionID DataValueTypeID LocationID \
0 PA1 Q047 VALUE 59
1 OWS1 Q036 VALUE 66
2 OWS1 Q036 VALUE 59
3 OWS1 Q037 VALUE 59
6 OWS1 Q037 VALUE 56
... ... ... ... ...
93244 OWS1 Q037 VALUE 56
93245 PA1 Q047 VALUE 56
93246 OWS1 Q036 VALUE 56
93247 OWS1 Q037 VALUE 56
93248 OWS1 Q036 VALUE 56
StratificationCategory1 Stratification1 \
0 Race/Ethnicity Hispanic
1 Education High school graduate
2 Income $50,000 - $74,999
3 Income Data not reported
6 Race/Ethnicity American Indian/Alaska Native
... ... ...
93244 Income Less than $15,000
93245 Education Less than high school
93246 Age (years) 35 - 44
93247 Income $35,000 - $49,999
93248 Education Less than high school
StratificationCategoryId1 StratificationID1
0 RACE RACEHIS
1 EDU EDUHSGRAD
2 INC INC5075
3 INC INCNR
6 RACE RACENAA
... ... ...
93244 INC INCLESS15
93245 EDU EDUHS
93246 AGEYR AGEYR3544
93247 INC INC3550
93248 EDU EDUHS
[54323 rows x 33 columns]
Step 3
After giving a closer look at the table, we see that some columns have missing values, represented as NaN. This will be an issue in our analysis as we cannot work with missing data. We need to find what causes this data to be missing, and how we can either filter it out, or replace it.
missing_values = cleaned_ntr.isna().sum()
print(missing_values)
YearStart 0 YearEnd 0 LocationAbbr 0 LocationDesc 0 Datasource 0 Class 0 Topic 0 Question 0 Data_Value_Unit 54323 Data_Value_Type 0 Data_Value 5264 Data_Value_Alt 5264 Data_Value_Footnote_Symbol 49059 Data_Value_Footnote 49059 Low_Confidence_Limit 5264 High_Confidence_Limit 5264 Sample_Size 5264 Total 52383 Age(years) 42683 Education 46563 Gender 50443 Income 40743 Race/Ethnicity 38803 GeoLocation 1008 ClassID 0 TopicID 0 QuestionID 0 DataValueTypeID 0 LocationID 0 StratificationCategory1 3 Stratification1 3 StratificationCategoryId1 3 StratificationID1 3 dtype: int64
Step 4
We see that our most important column, Data_Value, has mising values. These are the actual percent values we need for analysis. If these are missing, then we cannot do anything with those entries in the table. Our first step will be to remove the instances of this insufficient data, and continue pruning the missing entries.
remove_insuff_ntr = cleaned_ntr[cleaned_ntr['Data_Value'].notna()]
missing_values = remove_insuff_ntr.isna().sum()
print(missing_values)
print(remove_insuff_ntr)
YearStart 0
YearEnd 0
LocationAbbr 0
LocationDesc 0
Datasource 0
Class 0
Topic 0
Question 0
Data_Value_Unit 49059
Data_Value_Type 0
Data_Value 0
Data_Value_Alt 0
Data_Value_Footnote_Symbol 49059
Data_Value_Footnote 49059
Low_Confidence_Limit 0
High_Confidence_Limit 0
Sample_Size 0
Total 47128
Age(years) 37473
Education 41335
Gender 45197
Income 35545
Race/Ethnicity 38617
GeoLocation 1005
ClassID 0
TopicID 0
QuestionID 0
DataValueTypeID 0
LocationID 0
StratificationCategory1 0
Stratification1 0
StratificationCategoryId1 0
StratificationID1 0
dtype: int64
YearStart YearEnd LocationAbbr LocationDesc \
0 2020 2020 US National
1 2014 2014 GU Guam
2 2013 2013 US National
3 2013 2013 US National
6 2012 2012 WY Wyoming
... ... ... ... ...
93244 2022 2022 WY Wyoming
93245 2022 2022 WY Wyoming
93246 2022 2022 WY Wyoming
93247 2022 2022 WY Wyoming
93248 2022 2022 WY Wyoming
Datasource Class \
0 Behavioral Risk Factor Surveillance System Physical Activity
1 Behavioral Risk Factor Surveillance System Obesity / Weight Status
2 Behavioral Risk Factor Surveillance System Obesity / Weight Status
3 Behavioral Risk Factor Surveillance System Obesity / Weight Status
6 Behavioral Risk Factor Surveillance System Obesity / Weight Status
... ... ...
93244 BRFSS Obesity / Weight Status
93245 BRFSS Physical Activity
93246 BRFSS Obesity / Weight Status
93247 BRFSS Obesity / Weight Status
93248 BRFSS Obesity / Weight Status
Topic \
0 Physical Activity - Behavior
1 Obesity / Weight Status
2 Obesity / Weight Status
3 Obesity / Weight Status
6 Obesity / Weight Status
... ...
93244 Obesity / Weight Status
93245 Physical Activity - Behavior
93246 Obesity / Weight Status
93247 Obesity / Weight Status
93248 Obesity / Weight Status
Question Data_Value_Unit \
0 Percent of adults who engage in no leisure-tim... NaN
1 Percent of adults aged 18 years and older who ... NaN
2 Percent of adults aged 18 years and older who ... NaN
3 Percent of adults aged 18 years and older who ... NaN
6 Percent of adults aged 18 years and older who ... NaN
... ... ...
93244 Percent of adults aged 18 years and older who ... NaN
93245 Percent of adults who engage in no leisure-tim... NaN
93246 Percent of adults aged 18 years and older who ... NaN
93247 Percent of adults aged 18 years and older who ... NaN
93248 Percent of adults aged 18 years and older who ... NaN
Data_Value_Type ... GeoLocation ClassID \
0 Value ... NaN PA
1 Value ... (13.444304, 144.793731) OWS
2 Value ... NaN OWS
3 Value ... NaN OWS
6 Value ... (43.235541343, -108.109830353) OWS
... ... ... ... ...
93244 Value ... (43.23554134300048, -108.10983035299967) OWS
93245 Value ... (43.23554134300048, -108.10983035299967) PA
93246 Value ... (43.23554134300048, -108.10983035299967) OWS
93247 Value ... (43.23554134300048, -108.10983035299967) OWS
93248 Value ... (43.23554134300048, -108.10983035299967) OWS
TopicID QuestionID DataValueTypeID LocationID \
0 PA1 Q047 VALUE 59
1 OWS1 Q036 VALUE 66
2 OWS1 Q036 VALUE 59
3 OWS1 Q037 VALUE 59
6 OWS1 Q037 VALUE 56
... ... ... ... ...
93244 OWS1 Q037 VALUE 56
93245 PA1 Q047 VALUE 56
93246 OWS1 Q036 VALUE 56
93247 OWS1 Q037 VALUE 56
93248 OWS1 Q036 VALUE 56
StratificationCategory1 Stratification1 \
0 Race/Ethnicity Hispanic
1 Education High school graduate
2 Income $50,000 - $74,999
3 Income Data not reported
6 Race/Ethnicity American Indian/Alaska Native
... ... ...
93244 Income Less than $15,000
93245 Education Less than high school
93246 Age (years) 35 - 44
93247 Income $35,000 - $49,999
93248 Education Less than high school
StratificationCategoryId1 StratificationID1
0 RACE RACEHIS
1 EDU EDUHSGRAD
2 INC INC5075
3 INC INCNR
6 RACE RACENAA
... ... ...
93244 INC INCLESS15
93245 EDU EDUHS
93246 AGEYR AGEYR3544
93247 INC INC3550
93248 EDU EDUHS
[49059 rows x 33 columns]
Step 5
We see that three columns, Data_Value_Unit, Data_Value_Footnote_Symbol, and Data_Value_Footnote all have as many missing values as rows present in the table. This means every entry for these three is NaN, and they are a waste of columns. Our next step will be to remove them, and then to see remaining missing values, and the new-look dataframe.
drop_empty_ntr = remove_insuff_ntr.drop(['Data_Value_Unit', 'Data_Value_Footnote_Symbol', 'Data_Value_Footnote'], axis=1)
missing_values = drop_empty_ntr.isna().sum()
print(missing_values)
drop_empty_ntr.head()
YearStart 0 YearEnd 0 LocationAbbr 0 LocationDesc 0 Datasource 0 Class 0 Topic 0 Question 0 Data_Value_Type 0 Data_Value 0 Data_Value_Alt 0 Low_Confidence_Limit 0 High_Confidence_Limit 0 Sample_Size 0 Total 47128 Age(years) 37473 Education 41335 Gender 45197 Income 35545 Race/Ethnicity 38617 GeoLocation 1005 ClassID 0 TopicID 0 QuestionID 0 DataValueTypeID 0 LocationID 0 StratificationCategory1 0 Stratification1 0 StratificationCategoryId1 0 StratificationID1 0 dtype: int64
| YearStart | YearEnd | LocationAbbr | LocationDesc | Datasource | Class | Topic | Question | Data_Value_Type | Data_Value | ... | GeoLocation | ClassID | TopicID | QuestionID | DataValueTypeID | LocationID | StratificationCategory1 | Stratification1 | StratificationCategoryId1 | StratificationID1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | 2020 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | Value | 30.6 | ... | NaN | PA | PA1 | Q047 | VALUE | 59 | Race/Ethnicity | Hispanic | RACE | RACEHIS |
| 1 | 2014 | 2014 | GU | Guam | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | Value | 29.3 | ... | (13.444304, 144.793731) | OWS | OWS1 | Q036 | VALUE | 66 | Education | High school graduate | EDU | EDUHSGRAD |
| 2 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | Value | 28.8 | ... | NaN | OWS | OWS1 | Q036 | VALUE | 59 | Income | $50,000 - $74,999 | INC | INC5075 |
| 3 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | Value | 32.7 | ... | NaN | OWS | OWS1 | Q037 | VALUE | 59 | Income | Data not reported | INC | INCNR |
| 6 | 2012 | 2012 | WY | Wyoming | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | Value | 48.5 | ... | (43.235541343, -108.109830353) | OWS | OWS1 | Q037 | VALUE | 56 | Race/Ethnicity | American Indian/Alaska Native | RACE | RACENAA |
5 rows Γ 30 columns
Step 6
Looking back at our long list of unique values, we will notice other unnecessary columns. We do not need more than one column for Location, Class, Topic, or Question, so we will drop their ID columns. We also do not need a datasource column, it is only one source. Clas will not be necessary for our analysis either, as the questions will be enough to separate our analysis. Geolocation does not seem to be relevant, we will not be looking at coordinate data, so we will drop that as well.
We will also check to see if we need both YearStart and YearEnd. If there are 0 instances of the survery going beyond one calendar year, we will only need one year column for our data.
drop_red_ntr = drop_empty_ntr.drop(['LocationAbbr', 'Datasource', 'Class', 'Data_Value_Type', 'ClassID', 'TopicID', 'QuestionID', 'DataValueTypeID', 'LocationID', 'GeoLocation'], axis=1)
check_years = (drop_red_ntr['YearStart'] != drop_red_ntr['YearEnd']).sum()
print(check_years)
0
drop_red_ntr = drop_red_ntr.drop('YearEnd', axis=1)
drop_red_ntr = drop_red_ntr.rename(columns={'YearStart': 'Year'})
drop_red_ntr.head()
| Year | LocationDesc | Topic | Question | Data_Value | Data_Value_Alt | Low_Confidence_Limit | High_Confidence_Limit | Sample_Size | Total | Age(years) | Education | Gender | Income | Race/Ethnicity | StratificationCategory1 | Stratification1 | StratificationCategoryId1 | StratificationID1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | National | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | 30.6 | 30.6 | 29.4 | 31.8 | 31255.0 | NaN | NaN | NaN | NaN | NaN | Hispanic | Race/Ethnicity | Hispanic | RACE | RACEHIS |
| 1 | 2014 | Guam | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 29.3 | 29.3 | 25.7 | 33.3 | 842.0 | NaN | NaN | High school graduate | NaN | NaN | NaN | Education | High school graduate | EDU | EDUHSGRAD |
| 2 | 2013 | National | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 28.8 | 28.8 | 28.1 | 29.5 | 62562.0 | NaN | NaN | NaN | NaN | $50,000 - $74,999 | NaN | Income | $50,000 - $74,999 | INC | INC5075 |
| 3 | 2013 | National | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 32.7 | 32.7 | 31.9 | 33.5 | 60069.0 | NaN | NaN | NaN | NaN | Data not reported | NaN | Income | Data not reported | INC | INCNR |
| 6 | 2012 | Wyoming | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 48.5 | 48.5 | 32.3 | 64.9 | 69.0 | NaN | NaN | NaN | NaN | NaN | American Indian/Alaska Native | Race/Ethnicity | American Indian/Alaska Native | RACE | RACENAA |
Step 7
We now have pruned out most of the unnecessary data from our dataframe, and we will look to see if there are other forms of missing data present. Looking at the stratification categories, which are areas where the survery was separated by demographic group, we will see that there is one result that has incomplete data. Some respones for income did not report that income, which is represented with StratificationID1 of INCNR, so we will drop that as well.
We will do one final check on missing data and see that we now have 0 missing data that could cause problems. The reason we are okay with the values for the stratification categories, is that each question was asked independently, so a survery response involving education would not count for age, gender, income, or race.
print(drop_red_ntr['StratificationCategory1'].unique())
print(drop_red_ntr['StratificationID1'].unique())
print(drop_red_ntr['Stratification1'].unique())
['Race/Ethnicity' 'Education' 'Income' 'Age (years)' 'Gender' 'Total'] ['RACEHIS' 'EDUHSGRAD' 'INC5075' 'RACENAA' 'EDUHS' 'INC2535' 'AGEYR2534' 'INCLESS15' 'RACEASN' 'AGEYR5564' 'RACEOTH' 'EDUCOTEC' 'FEMALE' 'AGEYR1824' 'AGEYR4554' 'INC1525' 'OVERALL' 'MALE' 'AGEYR3544' 'RACE2PLUS' 'EDUCOGRAD' 'RACEWHT' 'INC3550' 'INC75PLUS' 'AGEYR65PLUS' 'RACEBLK' 'RACEHPI'] ['Hispanic' 'High school graduate' '$50,000 - $74,999' 'American Indian/Alaska Native' 'Less than high school' '$25,000 - $34,999' '25 - 34' 'Less than $15,000' 'Asian' '55 - 64' 'Other' 'Some college or technical school' 'Female' '18 - 24' '45 - 54' '$15,000 - $24,999' 'Total' 'Male' '35 - 44' '2 or more races' 'College graduate' 'Non-Hispanic White' '$35,000 - $49,999' '$75,000 or greater' '65 or older' 'Non-Hispanic Black' 'Hawaiian/Pacific Islander']
drop_red_ntr = drop_red_ntr[~drop_red_ntr['StratificationID1'].isin(['INCNR'])]
print(drop_red_ntr.isna().sum())
drop_red_ntr.head()
Year 0 LocationDesc 0 Topic 0 Question 0 Data_Value 0 Data_Value_Alt 0 Low_Confidence_Limit 0 High_Confidence_Limit 0 Sample_Size 0 Total 45197 Age(years) 35542 Education 39404 Gender 43266 Income 35545 Race/Ethnicity 36686 StratificationCategory1 0 Stratification1 0 StratificationCategoryId1 0 StratificationID1 0 dtype: int64
| Year | LocationDesc | Topic | Question | Data_Value | Data_Value_Alt | Low_Confidence_Limit | High_Confidence_Limit | Sample_Size | Total | Age(years) | Education | Gender | Income | Race/Ethnicity | StratificationCategory1 | Stratification1 | StratificationCategoryId1 | StratificationID1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | National | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | 30.6 | 30.6 | 29.4 | 31.8 | 31255.0 | NaN | NaN | NaN | NaN | NaN | Hispanic | Race/Ethnicity | Hispanic | RACE | RACEHIS |
| 1 | 2014 | Guam | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 29.3 | 29.3 | 25.7 | 33.3 | 842.0 | NaN | NaN | High school graduate | NaN | NaN | NaN | Education | High school graduate | EDU | EDUHSGRAD |
| 2 | 2013 | National | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 28.8 | 28.8 | 28.1 | 29.5 | 62562.0 | NaN | NaN | NaN | NaN | $50,000 - $74,999 | NaN | Income | $50,000 - $74,999 | INC | INC5075 |
| 6 | 2012 | Wyoming | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 48.5 | 48.5 | 32.3 | 64.9 | 69.0 | NaN | NaN | NaN | NaN | NaN | American Indian/Alaska Native | Race/Ethnicity | American Indian/Alaska Native | RACE | RACENAA |
| 7 | 2012 | District of Columbia | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 31.6 | 31.6 | 24.0 | 40.4 | 243.0 | NaN | NaN | Less than high school | NaN | NaN | NaN | Education | Less than high school | EDU | EDUHS |
Step 8
One last thing we will want to do is to make our dataframe easier to manage. Our CPI data is counted at the national level, so state data may vary greatly from that. To solve that, we will denote regions for states, and disregard U.S. territories from the data. We will follow regions as follows: 
Our next step will be to aggregate our data. We want to create a dataframe that will look for each year, for each question, separated by their stratification value and region. We want to get the average percent values for each of these combinations and we will use the groupby function to achieve this. We will also notice we have no more missing values in this aggregated dataframe.
#our regions
west = ['Washington', 'Oregon', 'California', 'Idaho', 'Nevada', 'Utah', 'Montana', 'Wyoming', 'Utah', 'Colorado', 'Alaska', 'Hawaii']
midwest = ['North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri', 'Michigan', 'Illinois', 'Michigan', 'Indiana', 'Ohio']
northeast = ['Pennsylvania', 'Maryland', 'Delaware', 'New Jersey', 'New York', 'New Hampshire', 'Vermont', 'Connecticut', 'Rhode Island', 'Maine', 'Massachusetts']
southeast = ['District of Columbia', 'Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'West Virginia', 'Kentucky', 'Tennessee', 'Arkansas', 'Louisiana', 'Mississippi', 'Alabama']
southwest = ['Oklahoma', 'Texas', 'New Mexico', 'Arizona']
regions = {'National': 'National'}
for state in west:
regions[state] = 'West'
for state in midwest:
regions[state] = 'Midwest'
for state in northeast:
regions[state] = 'Northeast'
for state in southeast:
regions[state] = 'Southeast'
for state in southwest:
regions[state] = 'Southwest'
#we create a new column that puts the region for each entry
drop_red_ntr['Region'] = drop_red_ntr['LocationDesc'].map(regions)
#we drop U.S. territories (minus D.C)
reg_ntr = drop_red_ntr.dropna(subset=['Region'])
#we use groupby the four variables, and take the mean percent respone, and print out the resultant dataframe and missing values
#we use reset_index to allow for the grouped by columns to be colummns, rather than multiindex values
reg_avgs = reg_ntr.groupby(['Year', 'Question', 'Stratification1', 'Region']).agg({
'Data_Value': 'mean'
}).reset_index()
print(reg_avgs['Data_Value'].isna().sum())
print(reg_avgs)
0
Year Question \
0 2011 Percent of adults aged 18 years and older who ...
1 2011 Percent of adults aged 18 years and older who ...
2 2011 Percent of adults aged 18 years and older who ...
3 2011 Percent of adults aged 18 years and older who ...
4 2011 Percent of adults aged 18 years and older who ...
... ... ...
5669 2022 Percent of adults who engage in no leisure-tim...
5670 2022 Percent of adults who engage in no leisure-tim...
5671 2022 Percent of adults who engage in no leisure-tim...
5672 2022 Percent of adults who engage in no leisure-tim...
5673 2022 Percent of adults who engage in no leisure-tim...
Stratification1 Region Data_Value
0 $15,000 - $24,999 Midwest 34.018182
1 $15,000 - $24,999 National 34.500000
2 $15,000 - $24,999 Northeast 33.763636
3 $15,000 - $24,999 Southeast 32.553846
4 $15,000 - $24,999 Southwest 34.375000
... ... ... ...
5669 Total National 23.700000
5670 Total Northeast 22.463636
5671 Total Southeast 26.007692
5672 Total Southwest 25.650000
5673 Total West 20.427273
[5674 rows x 5 columns]
Step 9
We will now get the CPI dataframe to be what we want. Our first step will be to find and handle missing values. We see that there are 25 of these, which is few enough to look at individually. Luckily for us, these missing values are for entries outside of our scope. Our survey time range is from 2011 to 2022, so these missing values from 1974 to 1998 are irrelevant in our analysis.
print(cpihist.isna().sum())
Consumer Price Index item 0 Year 0 Percent change 25 dtype: int64
missing_cpi = cpihist[cpihist['Percent change'].isna()]
missing_cpi.head(25)
| Consumer Price Index item | Year | Percent change | |
|---|---|---|---|
| 833 | Processed fruits and vegetables | 1974 | NaN |
| 834 | Processed fruits and vegetables | 1975 | NaN |
| 835 | Processed fruits and vegetables | 1976 | NaN |
| 836 | Processed fruits and vegetables | 1977 | NaN |
| 837 | Processed fruits and vegetables | 1978 | NaN |
| 838 | Processed fruits and vegetables | 1979 | NaN |
| 839 | Processed fruits and vegetables | 1980 | NaN |
| 840 | Processed fruits and vegetables | 1981 | NaN |
| 841 | Processed fruits and vegetables | 1982 | NaN |
| 842 | Processed fruits and vegetables | 1983 | NaN |
| 843 | Processed fruits and vegetables | 1984 | NaN |
| 844 | Processed fruits and vegetables | 1985 | NaN |
| 845 | Processed fruits and vegetables | 1986 | NaN |
| 846 | Processed fruits and vegetables | 1987 | NaN |
| 847 | Processed fruits and vegetables | 1988 | NaN |
| 848 | Processed fruits and vegetables | 1989 | NaN |
| 849 | Processed fruits and vegetables | 1990 | NaN |
| 850 | Processed fruits and vegetables | 1991 | NaN |
| 851 | Processed fruits and vegetables | 1992 | NaN |
| 852 | Processed fruits and vegetables | 1993 | NaN |
| 853 | Processed fruits and vegetables | 1994 | NaN |
| 854 | Processed fruits and vegetables | 1995 | NaN |
| 855 | Processed fruits and vegetables | 1996 | NaN |
| 856 | Processed fruits and vegetables | 1997 | NaN |
| 857 | Processed fruits and vegetables | 1998 | NaN |
Step 10
Similar to before, we want a look at what the data could represent. We first want to make sure the year column is an integer, this will help to remove entries outside of our time range. We see all of the food groups and the range of years.
We will disregard all entries from before 2010. The reason we chose 2010 instead fo 2011 is that we will be using the 2010 values as a baseline for analysis. This will help to see the effects of CPI changes from 2011.
print(cpihist['Year'].dtype)
for column in cpihist.columns:
print(cpihist[column].unique())
int64 ['All food' 'Food away from home' 'Food at home' 'Meats, poultry, and fish' 'Meats' 'Beef and veal' 'Pork' 'Other meats' 'Poultry' 'Fish and seafood' 'Eggs' 'Dairy products' 'Fats and oils' 'Fruits and vegetables' 'Fresh fruits and vegetables' 'Fresh fruits' 'Fresh vegetables' 'Processed fruits and vegetables' 'Sugar and sweets' 'Cereals and bakery products' 'Nonalcoholic beverages' 'Other foods'] [1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022] [ 14.3 8.5 3. 6.3 9.9 11. 8.6 7.8 4.1 2.1 3.8 2.3 3.2 5.8 2.9 1.2 2.2 2.4 2.8 3.3 2.6 1.8 3.4 4. 5.5 0.8 3.7 1.4 1.9 0.3 0.9 3.9 12.7 9.4 6.8 7.6 9.1 11.1 9. 5.4 4.4 4.2 4.6 4.7 2. 1.7 2.5 3.1 3.6 3.5 1.3 4.5 7.7 14.9 8.2 5.9 10.5 10.8 8.1 7.2 1. 1.5 4.3 6.5 0.7 6.4 0.5 4.8 -1.3 -0.2 0.4 11.4 -0.6 16.7 -0.7 -0.4 5. 7.3 -0.8 7.4 -3.5 -0.1 9.6 0.2 -2.3 18.6 17. 4.9 -1.2 -0.9 7.5 10.1 -1.4 0.6 0.1 -1.9 5.7 8.4 8.8 9.2 -4.4 -3.2 22.9 27.4 -1.5 -2.1 8. -0.3 11.5 -1. 10.2 12.1 -6.3 1.6 9.3 5.3 -0.5 22.4 -5.4 12.9 -3.3 12.8 -3. 14.7 -4.7 5.2 -1.8 5.6 -2. -3.9 -4.1 8.7 17.7 14.2 -5.1 10.4 -1.7 10.6 6.2 5.1 -2.7 14.6 15.3 11.7 9.7 8.3 1.1 -1.6 7.1 9.5 -16.6 -5.9 26.6 -10.6 -2.4 17.9 13.8 -13.7 29.2 14. -14.7 17.8 -21.1 -9.5 -10. 32.2 2.7 11.6 9.8 6.6 -1.1 7. -6.4 12. 41.9 10.7 -12.5 7.9 6.7 -2.2 6.1 18.5 16.5 8.9 13.6 10.9 12.2 10.3 -4.8 19.5 12.4 -4.2 11.2 13.5 -5. 6. -6.1 12.6 -4.3 -3.4 0. nan 52.4 26.1 -11.3 12.3 23. 29.9 11.3 10. 13. 19.3 15. 19.6 50.6 -2.6 6.9 21.4]
cpihist = cpihist[cpihist['Year'] >= 2010]
cpihist.head()
| Consumer Price Index item | Year | Percent change | |
|---|---|---|---|
| 36 | All food | 2010 | 0.8 |
| 37 | All food | 2011 | 3.7 |
| 38 | All food | 2012 | 2.6 |
| 39 | All food | 2013 | 1.4 |
| 40 | All food | 2014 | 2.4 |
Step 11
Our next step is to add a price index column to the dataframe. We will set all values in 2010 to be a baseline of 100. For each year and food item, we will simply apply the formula:

This accounts for each previous year's percent change into the current year's index. Because this dataframe is rather short, we will also print out all values for each food item just to take a quick glance over the data, ensuring its correctness.
def create_index(item):
list = [100]
#for each item, applies the above formula to the entry for price index
for r in range(1, len(item)):
next = list[-1] * (1 + item['Percent change'].iloc[r] / 100)
list.append(next)
item['Price Index'] = list
return item
cpi_index = cpihist.groupby('Consumer Price Index item').apply(create_index).reset_index(drop=True)
cpi_index.head()
/var/folders/fw/m7c8y6mj7qq2w3wxy4ynlxph0000gn/T/ipykernel_98603/2485733993.py:10: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
cpi_index = cpihist.groupby('Consumer Price Index item').apply(create_index).reset_index(drop=True)
| Consumer Price Index item | Year | Percent change | Price Index | |
|---|---|---|---|---|
| 0 | All food | 2010 | 0.8 | 100.000000 |
| 1 | All food | 2011 | 3.7 | 103.700000 |
| 2 | All food | 2012 | 2.6 | 106.396200 |
| 3 | All food | 2013 | 1.4 | 107.885747 |
| 4 | All food | 2014 | 2.4 | 110.475005 |
for group in cpi_index.groupby('Consumer Price Index item'):
print(group)
('All food', Consumer Price Index item Year Percent change Price Index
0 All food 2010 0.8 100.000000
1 All food 2011 3.7 103.700000
2 All food 2012 2.6 106.396200
3 All food 2013 1.4 107.885747
4 All food 2014 2.4 110.475005
5 All food 2015 1.9 112.574030
6 All food 2016 0.3 112.911752
7 All food 2017 0.9 113.927958
8 All food 2018 1.4 115.522949
9 All food 2019 1.9 117.717885
10 All food 2020 3.4 121.720293
11 All food 2021 3.9 126.467385
12 All food 2022 9.9 138.987656)
('Beef and veal', Consumer Price Index item Year Percent change Price Index
13 Beef and veal 2010 2.9 100.000000
14 Beef and veal 2011 10.2 110.200000
15 Beef and veal 2012 6.4 117.252800
16 Beef and veal 2013 2.0 119.597856
17 Beef and veal 2014 12.1 134.069197
18 Beef and veal 2015 7.2 143.722179
19 Beef and veal 2016 -6.3 134.667681
20 Beef and veal 2017 -1.2 133.051669
21 Beef and veal 2018 1.4 134.914393
22 Beef and veal 2019 1.6 137.073023
23 Beef and veal 2020 9.6 150.232033
24 Beef and veal 2021 9.3 164.203612
25 Beef and veal 2022 5.3 172.906404)
('Cereals and bakery products', Consumer Price Index item Year Percent change Price Index
26 Cereals and bakery products 2010 -0.8 100.000000
27 Cereals and bakery products 2011 3.9 103.900000
28 Cereals and bakery products 2012 2.8 106.809200
29 Cereals and bakery products 2013 1.0 107.877292
30 Cereals and bakery products 2014 0.2 108.093047
31 Cereals and bakery products 2015 1.1 109.282070
32 Cereals and bakery products 2016 -0.3 108.954224
33 Cereals and bakery products 2017 -0.5 108.409453
34 Cereals and bakery products 2018 0.4 108.843091
35 Cereals and bakery products 2019 1.4 110.366894
36 Cereals and bakery products 2020 2.2 112.794966
37 Cereals and bakery products 2021 2.3 115.389250
38 Cereals and bakery products 2022 13.0 130.389852)
('Dairy products', Consumer Price Index item Year Percent change Price Index
39 Dairy products 2010 1.1 100.000000
40 Dairy products 2011 6.8 106.800000
41 Dairy products 2012 2.1 109.042800
42 Dairy products 2013 0.1 109.151843
43 Dairy products 2014 3.6 113.081309
44 Dairy products 2015 -1.3 111.611252
45 Dairy products 2016 -2.3 109.044193
46 Dairy products 2017 0.1 109.153238
47 Dairy products 2018 -0.5 108.607471
48 Dairy products 2019 1.0 109.693546
49 Dairy products 2020 4.4 114.520062
50 Dairy products 2021 1.4 116.123343
51 Dairy products 2022 12.0 130.058144)
('Eggs', Consumer Price Index item Year Percent change Price Index
52 Eggs 2010 1.5 100.000000
53 Eggs 2011 9.2 109.200000
54 Eggs 2012 3.2 112.694400
55 Eggs 2013 3.3 116.413315
56 Eggs 2014 8.4 126.192034
57 Eggs 2015 17.8 148.654216
58 Eggs 2016 -21.1 117.288176
59 Eggs 2017 -9.5 106.145799
60 Eggs 2018 10.8 117.609546
61 Eggs 2019 -10.0 105.848591
62 Eggs 2020 4.3 110.400081
63 Eggs 2021 4.5 115.368084
64 Eggs 2022 32.2 152.516607)
('Fats and oils', Consumer Price Index item Year Percent change Price Index
65 Fats and oils 2010 -0.3 100.000000
66 Fats and oils 2011 9.3 109.300000
67 Fats and oils 2012 6.1 115.967300
68 Fats and oils 2013 -1.4 114.343758
69 Fats and oils 2014 0.1 114.458102
70 Fats and oils 2015 -1.0 113.313521
71 Fats and oils 2016 -0.6 112.633639
72 Fats and oils 2017 0.8 113.534709
73 Fats and oils 2018 0.1 113.648243
74 Fats and oils 2019 -0.7 112.852706
75 Fats and oils 2020 1.3 114.319791
76 Fats and oils 2021 4.6 119.578501
77 Fats and oils 2022 18.5 141.700524)
('Fish and seafood', Consumer Price Index item Year Percent change Price Index
78 Fish and seafood 2010 1.1 100.000000
79 Fish and seafood 2011 7.1 107.100000
80 Fish and seafood 2012 2.4 109.670400
81 Fish and seafood 2013 2.5 112.412160
82 Fish and seafood 2014 5.8 118.932065
83 Fish and seafood 2015 -0.9 117.861677
84 Fish and seafood 2016 -0.7 117.036645
85 Fish and seafood 2017 1.2 118.441085
86 Fish and seafood 2018 2.1 120.928347
87 Fish and seafood 2019 1.6 122.863201
88 Fish and seafood 2020 3.3 126.917687
89 Fish and seafood 2021 5.4 133.771242
90 Fish and seafood 2022 9.1 145.944425)
('Food at home', Consumer Price Index item Year Percent change Price Index
91 Food at home 2010 0.3 100.000000
92 Food at home 2011 4.8 104.800000
93 Food at home 2012 2.5 107.420000
94 Food at home 2013 0.9 108.386780
95 Food at home 2014 2.4 110.988063
96 Food at home 2015 1.2 112.319919
97 Food at home 2016 -1.3 110.859761
98 Food at home 2017 -0.2 110.638041
99 Food at home 2018 0.4 111.080593
100 Food at home 2019 0.9 112.080319
101 Food at home 2020 3.5 116.003130
102 Food at home 2021 3.5 120.063239
103 Food at home 2022 11.4 133.750448)
('Food away from home', Consumer Price Index item Year Percent change Price Index
104 Food away from home 2010 1.3 100.000000
105 Food away from home 2011 2.3 102.300000
106 Food away from home 2012 2.8 105.164400
107 Food away from home 2013 2.1 107.372852
108 Food away from home 2014 2.4 109.949801
109 Food away from home 2015 2.9 113.138345
110 Food away from home 2016 2.6 116.079942
111 Food away from home 2017 2.3 118.749781
112 Food away from home 2018 2.6 121.837275
113 Food away from home 2019 3.1 125.614231
114 Food away from home 2020 3.4 129.885114
115 Food away from home 2021 4.5 135.729945
116 Food away from home 2022 7.7 146.181150)
('Fresh fruits', Consumer Price Index item Year Percent change Price Index
117 Fresh fruits 2010 -0.6 100.000000
118 Fresh fruits 2011 3.3 103.300000
119 Fresh fruits 2012 1.0 104.333000
120 Fresh fruits 2013 2.0 106.419660
121 Fresh fruits 2014 4.8 111.527804
122 Fresh fruits 2015 -2.2 109.074192
123 Fresh fruits 2016 2.2 111.473824
124 Fresh fruits 2017 0.5 112.031193
125 Fresh fruits 2018 1.0 113.151505
126 Fresh fruits 2019 -1.4 111.567384
127 Fresh fruits 2020 -0.8 110.674845
128 Fresh fruits 2021 5.5 116.761962
129 Fresh fruits 2022 7.9 125.986157)
('Fresh fruits and vegetables', Consumer Price Index item Year Percent change Price Index
130 Fresh fruits and vegetables 2010 0.7 100.000000
131 Fresh fruits and vegetables 2011 4.5 104.500000
132 Fresh fruits and vegetables 2012 -2.0 102.410000
133 Fresh fruits and vegetables 2013 3.3 105.789530
134 Fresh fruits and vegetables 2014 1.9 107.799531
135 Fresh fruits and vegetables 2015 -0.5 107.260533
136 Fresh fruits and vegetables 2016 1.2 108.547660
137 Fresh fruits and vegetables 2017 0.2 108.764755
138 Fresh fruits and vegetables 2018 1.1 109.961167
139 Fresh fruits and vegetables 2019 1.0 111.060779
140 Fresh fruits and vegetables 2020 0.8 111.949265
141 Fresh fruits and vegetables 2021 3.3 115.643591
142 Fresh fruits and vegetables 2022 7.5 124.316860)
('Fresh vegetables', Consumer Price Index item Year Percent change Price Index
143 Fresh vegetables 2010 2.0 100.000000
144 Fresh vegetables 2011 5.6 105.600000
145 Fresh vegetables 2012 -5.1 100.214400
146 Fresh vegetables 2013 4.7 104.924477
147 Fresh vegetables 2014 -1.3 103.560459
148 Fresh vegetables 2015 1.6 105.217426
149 Fresh vegetables 2016 0.0 105.217426
150 Fresh vegetables 2017 -0.1 105.112209
151 Fresh vegetables 2018 1.1 106.268443
152 Fresh vegetables 2019 3.8 110.306644
153 Fresh vegetables 2020 2.6 113.174616
154 Fresh vegetables 2021 1.1 114.419537
155 Fresh vegetables 2022 7.0 122.428905)
('Fruits and vegetables', Consumer Price Index item Year Percent change Price Index
156 Fruits and vegetables 2010 0.2 100.000000
157 Fruits and vegetables 2011 4.1 104.100000
158 Fruits and vegetables 2012 -0.6 103.475400
159 Fruits and vegetables 2013 2.5 106.062285
160 Fruits and vegetables 2014 1.5 107.653219
161 Fruits and vegetables 2015 -0.2 107.437913
162 Fruits and vegetables 2016 0.8 108.297416
163 Fruits and vegetables 2017 -0.2 108.080821
164 Fruits and vegetables 2018 0.7 108.837387
165 Fruits and vegetables 2019 1.0 109.925761
166 Fruits and vegetables 2020 1.4 111.464722
167 Fruits and vegetables 2021 3.2 115.031593
168 Fruits and vegetables 2022 8.5 124.809278)
('Meats', Consumer Price Index item Year Percent change Price Index
169 Meats 2010 2.8 100.000000
170 Meats 2011 8.8 108.800000
171 Meats 2012 3.4 112.499200
172 Meats 2013 1.2 113.849190
173 Meats 2014 9.2 124.323316
174 Meats 2015 3.0 128.053015
175 Meats 2016 -4.4 122.418683
176 Meats 2017 -0.6 121.684171
177 Meats 2018 0.4 122.170907
178 Meats 2019 1.3 123.759129
179 Meats 2020 7.4 132.917305
180 Meats 2021 7.7 143.151937
181 Meats 2022 8.2 154.890396)
('Meats, poultry, and fish', Consumer Price Index item Year Percent change Price Index
182 Meats, poultry, and fish 2010 1.9 100.000000
183 Meats, poultry, and fish 2011 7.4 107.400000
184 Meats, poultry, and fish 2012 3.6 111.266400
185 Meats, poultry, and fish 2013 2.1 113.602994
186 Meats, poultry, and fish 2014 7.2 121.782410
187 Meats, poultry, and fish 2015 1.9 124.096276
188 Meats, poultry, and fish 2016 -3.5 119.752906
189 Meats, poultry, and fish 2017 -0.1 119.633153
190 Meats, poultry, and fish 2018 0.7 120.470585
191 Meats, poultry, and fish 2019 1.0 121.675291
192 Meats, poultry, and fish 2020 6.3 129.340834
193 Meats, poultry, and fish 2021 6.8 138.136011
194 Meats, poultry, and fish 2022 9.6 151.397068)
('Nonalcoholic beverages', Consumer Price Index item Year Percent change Price Index
195 Nonalcoholic beverages 2010 -0.9 100.000000
196 Nonalcoholic beverages 2011 3.2 103.200000
197 Nonalcoholic beverages 2012 1.1 104.335200
198 Nonalcoholic beverages 2013 -1.0 103.291848
199 Nonalcoholic beverages 2014 -0.5 102.775389
200 Nonalcoholic beverages 2015 1.1 103.905918
201 Nonalcoholic beverages 2016 -0.4 103.490294
202 Nonalcoholic beverages 2017 0.2 103.697275
203 Nonalcoholic beverages 2018 0.0 103.697275
204 Nonalcoholic beverages 2019 1.9 105.667523
205 Nonalcoholic beverages 2020 3.6 109.471554
206 Nonalcoholic beverages 2021 2.8 112.536758
207 Nonalcoholic beverages 2022 11.0 124.915801)
('Other foods', Consumer Price Index item Year Percent change Price Index
208 Other foods 2010 -0.5 100.000000
209 Other foods 2011 2.3 102.300000
210 Other foods 2012 3.5 105.880500
211 Other foods 2013 0.5 106.409902
212 Other foods 2014 1.0 107.474002
213 Other foods 2015 1.6 109.193586
214 Other foods 2016 0.3 109.521166
215 Other foods 2017 0.1 109.630687
216 Other foods 2018 0.1 109.740318
217 Other foods 2019 0.3 110.069539
218 Other foods 2020 3.1 113.481695
219 Other foods 2021 2.2 115.978292
220 Other foods 2022 12.7 130.707535)
('Other meats', Consumer Price Index item Year Percent change Price Index
221 Other meats 2010 -0.1 100.000000
222 Other meats 2011 6.4 106.400000
223 Other meats 2012 1.7 108.208800
224 Other meats 2013 -0.1 108.100591
225 Other meats 2014 3.9 112.316514
226 Other meats 2015 4.1 116.921491
227 Other meats 2016 -0.9 115.869198
228 Other meats 2017 -0.9 114.826375
229 Other meats 2018 -0.4 114.367070
230 Other meats 2019 0.9 115.396373
231 Other meats 2020 4.4 120.473814
232 Other meats 2021 2.9 123.967554
233 Other meats 2022 14.2 141.570947)
('Pork', Consumer Price Index item Year Percent change Price Index
234 Pork 2010 4.7 100.000000
235 Pork 2011 8.5 108.500000
236 Pork 2012 0.3 108.825500
237 Pork 2013 0.9 109.804929
238 Pork 2014 9.1 119.797178
239 Pork 2015 -3.9 115.125088
240 Pork 2016 -4.1 110.404960
241 Pork 2017 0.6 111.067389
242 Pork 2018 -0.4 110.623120
243 Pork 2019 1.2 111.950597
244 Pork 2020 6.3 119.003485
245 Pork 2021 8.6 129.237784
246 Pork 2022 8.7 140.481472)
('Poultry', Consumer Price Index item Year Percent change Price Index
247 Poultry 2010 -0.1 100.000000
248 Poultry 2011 2.9 102.900000
249 Poultry 2012 5.5 108.559500
250 Poultry 2013 4.7 113.661796
251 Poultry 2014 2.0 115.935032
252 Poultry 2015 0.4 116.398773
253 Poultry 2016 -2.7 113.256006
254 Poultry 2017 0.2 113.482518
255 Poultry 2018 0.3 113.822965
256 Poultry 2019 -0.3 113.481496
257 Poultry 2020 5.6 119.836460
258 Poultry 2021 5.1 125.948120
259 Poultry 2022 14.6 144.336545)
('Processed fruits and vegetables', Consumer Price Index item Year Percent change Price Index
260 Processed fruits and vegetables 2010 -1.3 100.000000
261 Processed fruits and vegetables 2011 2.9 102.900000
262 Processed fruits and vegetables 2012 3.8 106.810200
263 Processed fruits and vegetables 2013 0.3 107.130631
264 Processed fruits and vegetables 2014 0.1 107.237761
265 Processed fruits and vegetables 2015 0.7 107.988426
266 Processed fruits and vegetables 2016 -0.3 107.664460
267 Processed fruits and vegetables 2017 -1.6 105.941829
268 Processed fruits and vegetables 2018 -0.6 105.306178
269 Processed fruits and vegetables 2019 1.1 106.464546
270 Processed fruits and vegetables 2020 3.5 110.190805
271 Processed fruits and vegetables 2021 2.9 113.386338
272 Processed fruits and vegetables 2022 12.0 126.992699)
('Sugar and sweets', Consumer Price Index item Year Percent change Price Index
273 Sugar and sweets 2010 2.2 100.000000
274 Sugar and sweets 2011 3.3 103.300000
275 Sugar and sweets 2012 3.3 106.708900
276 Sugar and sweets 2013 -1.7 104.894849
277 Sugar and sweets 2014 -0.8 104.055690
278 Sugar and sweets 2015 3.2 107.385472
279 Sugar and sweets 2016 -0.4 106.955930
280 Sugar and sweets 2017 -0.1 106.848974
281 Sugar and sweets 2018 0.4 107.276370
282 Sugar and sweets 2019 2.0 109.421897
283 Sugar and sweets 2020 3.3 113.032820
284 Sugar and sweets 2021 3.0 116.423805
285 Sugar and sweets 2022 10.4 128.531880)
Data VisualizationΒΆ
To better understand our data, we are going to visualize it. We will be primarily using line graphs and scatter plots to try and find visual relationships and trend between data items present in our two dataframe. We will be using matplotlib and seaborn to handle all of our visualizations.
Step 1
Our first graph will be graphing each price index of each food group over time in our CPI data. We will do this to see if food prices have increased. This is beneficial, as we can then use these food trends to see if the price of food has an impact on American adult health.
Looking at the resultant graph, we see that every food group has increased in price since 2010, with some growing much higher in price than others, such as Beef and Veal, Dairy, Meats, and Eggs. It is also clear that since 2020 and COVID, food prices have increased at higher rates than before, which is expected considering the supply issues present from the time.
food_groups = cpi_index['Consumer Price Index item'].unique()
plt.figure(figsize=(15,15))
#we want to plot a line for each individual food group, we simply put the price index for each year
for food in food_groups:
cpi_food = cpi_index[cpi_index['Consumer Price Index item'] == food]
plt.plot(cpi_food['Year'], cpi_food['Price Index'], label=food)
plt.title('Indexed Price of CPI Food Groups from 2010 to 2022')
plt.xlabel('Year')
plt.ylabel('Indexed Price')
plt.legend()
plt.show()
Step 2
We will now want to see how obesity, overwieght, and inactivity rates have changed over time. To start with, we will use the national totals for each response, just to get an overview of the national rates.
The graphs we get are interesting. For the inactivity graph, we see strange patterns in constant spiking and dropping in percents, with a final drop in 2020, most likely due to COVID, which does not recover to higher values as other drops did in the graph.
For the obesity graph we see the expected rise in obesity rates over time. For the overweight graph we see that rates have actually decreased. While this may seem like a good trend at first glance, it is more the fact that overweight Americans are simply being outpaced by obese Americans. More Americans are becoming obese rather than overweight, which is the reason that overweight values are falling while obese rates increase.
#we create a reusable function to allow for easier reproduction of new graphs. we take in which dataframe,
#which question, which demographic to graph and which region to use
def graph_questions(df, question, strat, region):
plt.figure(figsize=(10,10))
#we filter the main dataframe given down to only including the group featued in the Question->Strata->Region classification
sub1 = df[df['Question'] == question]
sub2 = sub1[sub1['Stratification1'] == strat]
fixed_df = sub2[sub2['Region'] == region]
x = fixed_df['Year']
y = fixed_df['Data_Value']
plt.plot(x, y)
plt.title(region + ' ' + question + ' with classifier: ' + strat)
plt.xlabel('Year')
plt.ylabel('Percent')
plt.show()
#graph for each question of interest
for q in nut_list:
graph_questions(reg_avgs, q, 'Total', 'National')
Step 3
We will now want to look at how different groups have trended compared to the national average. Unfortunately, there are too many combinations to give a graph for every possible one, so we will focus on groups determined to be at higher risks for these health problems. Looking at social and economic risk factors, indivuduals with lower income values are at higher risks of obesity, as well as Black and Hispanic adult Americans. We will choose to focus only on obesity rates for these groups, as we saw before, overweight classifications may not be accurate for our data.
Upon graphing the risk group obesity rates across the nation, we see that all have them have higher obesity rates over the same period as the national total average.
risk_groups = ['Less than $15,000', '$15,000 - $24,999', '$25,000 - $34,999', '$35,000 - $49,999', 'Non-Hispanic Black', 'Hispanic']
q = 'Percent of adults aged 18 years and older who have obesity'
for grp in risk_groups:
graph_questions(reg_avgs, q, grp, 'National')
Step 4
To really see the difference between the at risk and not at risk groups mentioned before, we will graph their obesity trends on the same graph. To do this, we will create a new column in our dataframe that classifies whether a responder falls in an at risk group or not at risk group.
Upon graphing the two trends, we see a large difference in obeisty rates between the two groups. The at risk group has as much as 5% higher obesity rates than the non at risk groups, showing a big difference between the two.
#adding the risk column by checking our risk_groups list from before
reg_avgs['Risk'] = reg_avgs['Stratification1'].apply(lambda r: 'At Risk' if r in risk_groups else 'Not At Risk')
plt.figure(figsize=(10,10))
x = reg_avgs['Year'].unique()
#finding the average rates among the risk and non risk groups
yrisk = reg_avgs[reg_avgs['Risk'] == 'At Risk'].groupby('Year')['Data_Value'].mean().reset_index()['Data_Value']
ynot = reg_avgs[reg_avgs['Risk'] == 'Not At Risk'].groupby('Year')['Data_Value'].mean().reset_index()['Data_Value']
plt.plot(x, yrisk, label='Risk Group')
plt.plot(x, ynot, label='Not At Risk Group')
plt.title('Average National Obesity Rates of Risk Group Compared to Not at Risk Group')
plt.xlabel('Year')
plt.ylabel('Percent')
plt.legend()
plt.show()
Step 5
To furhter our analysis, we are going to merge our two dataframes together. This will help allow us to see if relationships between food prices and obesity and inactivity rates exist. We will accomplish this by a simple merge function of the data, and for our first analyses, we will look at National obesity rates compares to the indexed price of food. We will do this by creating scatter plots of obesity rates over the index prices for the groups we choose. We will follow before and look at differences between groups at risk and groups not at risk. We will add lines of best fit to see if linear trends exist.
The method of graphing will be using seaborn's regplot function, which will create a scatter plot between the x (Price Index) and y (Obesity Rates) variables, as well as adding lines of best fits to each graph, as well as a confidence interval of possible ranges of the lines.
merged = pd.merge(reg_avgs, cpi_index, on='Year', how='inner')
merged.head()
| Year | Question | Stratification1 | Region | Data_Value | Risk | Consumer Price Index item | Percent change | Price Index | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011 | Percent of adults aged 18 years and older who ... | $15,000 - $24,999 | Midwest | 34.018182 | At Risk | All food | 3.7 | 103.7 |
| 1 | 2011 | Percent of adults aged 18 years and older who ... | $15,000 - $24,999 | Midwest | 34.018182 | At Risk | Beef and veal | 10.2 | 110.2 |
| 2 | 2011 | Percent of adults aged 18 years and older who ... | $15,000 - $24,999 | Midwest | 34.018182 | At Risk | Cereals and bakery products | 3.9 | 103.9 |
| 3 | 2011 | Percent of adults aged 18 years and older who ... | $15,000 - $24,999 | Midwest | 34.018182 | At Risk | Dairy products | 6.8 | 106.8 |
| 4 | 2011 | Percent of adults aged 18 years and older who ... | $15,000 - $24,999 | Midwest | 34.018182 | At Risk | Eggs | 9.2 | 109.2 |
#filter to see only national and and all food prices
tot_and_nat = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == 'All food') & (merged['Question'] == 'Percent of adults aged 18 years and older who have obesity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
'Data_Value': 'mean'
}).reset_index()
plt.figure(figsize=(10, 10))
xrisk = tot_and_nat[tot_and_nat['Risk'] == 'At Risk']['Price Index']
yrisk = tot_and_nat[tot_and_nat['Risk'] == 'At Risk']['Data_Value']
xnot = tot_and_nat[tot_and_nat['Risk'] == 'Not At Risk']['Price Index']
ynot = tot_and_nat[tot_and_nat['Risk'] == 'Not At Risk']['Data_Value']
sns.regplot(x=xrisk, y=yrisk, label='At Risk')
sns.regplot(x=xnot, y=ynot, label='Not At Risk')
plt.title('National Obesity Rates of At Risk and Not At Risk Groups Compared to Indexed Price of Food')
plt.xlabel('Indexed Price of Food')
plt.ylabel('Obesity Percentage')
plt.legend()
plt.show()
Step 6
Looking at the data of all food, we do see trends of higher indexed food prices with obesity rates, with the risk group having higher rates than the non-risk group.
To continue our analysis, we will look at the following foods of interest and create similar graphs: Meat, Fats and Oils, Fruits and Vegetables, and Sugar and Sweets.
Looking at the graphs, save for the Fats and Oils graph, which does not seem appropriate to graph, as the food price did not change as much as the others, we see similar trends. Higher food prices led to higher obesity rates, with fats and oils have the smallest trend of food groups.
interest_food = ['Meats', 'Fats and oils', 'Fruits and vegetables', 'Sugar and sweets']
for food in interest_food:
df = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == food) & (merged['Question'] == 'Percent of adults aged 18 years and older who have obesity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
'Data_Value': 'mean'
}).reset_index()
plt.figure(figsize=(10, 10))
xrisk = df[tot_and_nat['Risk'] == 'At Risk']['Price Index']
yrisk = df[df['Risk'] == 'At Risk']['Data_Value']
xnot = df[df['Risk'] == 'Not At Risk']['Price Index']
ynot = df[df['Risk'] == 'Not At Risk']['Data_Value']
sns.regplot(x=xrisk, y=yrisk, label='At Risk')
sns.regplot(x=xnot, y=ynot, label='Not At Risk')
plt.title('National Obesity Rates of At Risk and Not At Risk Groups Compared to Indexed Price of ' + food)
plt.xlabel('Indexed Price of ' + food)
plt.ylabel('Obesity Percentage')
plt.legend()
plt.show()
Step 7
Our next step to see if annual percent changes in food had any impact on obesity rates. To do this we will follow a very similar process, simply replacing our x axis with the percent change in food price, rather than the price index.
Upon graphing, we see interesting trends. We see that in years with higher percent increases, obesity rates are somewhat higher. These graphs are not as impactful as previously for the indexed price. The issue for this may be the latent affects of annual percent changes in foods. If food prices change in a year, it may take longer for individuals to change their BMI classification than in a year, meaning a percent change in one year may have more of an impact in the next year rather than the current one.
for food in interest_food:
df = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == food) & (merged['Question'] == 'Percent of adults aged 18 years and older who have obesity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
'Data_Value': 'mean'
}).reset_index()
plt.figure(figsize=(10, 10))
xrisk = df[df['Risk'] == 'At Risk']['Percent change']
yrisk = df[df['Risk'] == 'At Risk']['Data_Value']
xnot = df[df['Risk'] == 'Not At Risk']['Percent change']
ynot = df[df['Risk'] == 'Not At Risk']['Data_Value']
sns.regplot(x=xrisk, y=yrisk, label='At Risk')
sns.regplot(x=xnot, y=ynot, label='Not At Risk')
plt.title('National Obesity Rates of At Risk and Not At Risk Groups Compared to Percent Change in Price of ' + food)
plt.xlabel('Percent Change in Price of ' + food)
plt.ylabel('Obesity Percentage')
plt.legend()
plt.show()
Step 8
Our next step will be to reproduce these food trends for physical activity rates. It could be thought that if food is more expensive in a year, individuals will change their physical activity habits to account for either an increase of ease in purchasing food, or a restriction in access to food.
We follow a similar metohd as before, using seaborn's regplot function, our only difference is that we replace the question we filter by to the physical activity question.
On looking at the graphs, we see a different trend. A difference is created between at-risk and non-risk groups in their response. As prices increase, at-risk groups tend to increase their physical activity, while non-risk groups decrease their physical activity
interest_food.append('All food')
for food in interest_food:
df = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == food) & (merged['Question'] == 'Percent of adults who engage in no leisure-time physical activity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
'Data_Value': 'mean'
}).reset_index()
plt.figure(figsize=(10, 10))
xrisk = df[df['Risk'] == 'At Risk']['Price Index']
yrisk = df[df['Risk'] == 'At Risk']['Data_Value']
xnot = df[df['Risk'] == 'Not At Risk']['Price Index']
ynot = df[df['Risk'] == 'Not At Risk']['Data_Value']
sns.regplot(x=xrisk, y=yrisk, label='At Risk')
sns.regplot(x=xnot, y=ynot, label='Not At Risk')
plt.title('National Rate of Adults Engaging in no Leisure-Time Physical Acticity for At Risk and Not At Risk Groups Compared to Indexed Price of ' + food)
plt.xlabel('Indexed Price of ' + food)
plt.ylabel('Percent of Inactive Adults')
plt.legend()
plt.show()
Step 9
We are now going to continue our visual analysis to regional variation. We want to see if different regions of the U.S. have different obesity rate and inactivity responses to food prices.
We are going to follow a very similar method as before. The difference now is that we replace the region we filter by iterating through the regions list.
After looking at the graphs produced for obesity, we see geographical differences do exist. Our main variation exists between the West and Southeast regions. We see that the Western region has lower obesity response rates to food price, while the southeast region has much higher obesity response rates.
For the inactivity graphs, we also see similar geographical divides. The difference between risk and non-risk groups exist as seen before from food groups, but we see that the divide is much more pronounced in the Southeast and Northeast regions, while the West has less of a pronounced effect.
for region in set(regions.values()):
df = merged[(merged['Region'] == region) & (merged['Consumer Price Index item'] == 'All food') & (merged['Question'] == 'Percent of adults aged 18 years and older who have obesity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
'Data_Value': 'mean'
}).reset_index()
plt.figure(figsize=(10, 10))
xrisk = df[df['Risk'] == 'At Risk']['Price Index']
yrisk = df[df['Risk'] == 'At Risk']['Data_Value']
xnot = df[df['Risk'] == 'Not At Risk']['Price Index']
ynot = df[df['Risk'] == 'Not At Risk']['Data_Value']
sns.regplot(x=xrisk, y=yrisk, label='At Risk')
sns.regplot(x=xnot, y=ynot, label='Not At Risk')
plt.title(region + ' Obesity Rates of At Risk and Not At Risk Groups Compared to Indexed Price of Food')
plt.xlabel('Indexed Price of Food')
plt.ylabel('Obesity Percentage')
plt.legend()
plt.show()
for region in set(regions.values()):
df = merged[(merged['Region'] == region) & (merged['Consumer Price Index item'] == 'All food') & (merged['Question'] == 'Percent of adults who engage in no leisure-time physical activity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
'Data_Value': 'mean'
}).reset_index()
plt.figure(figsize=(10, 10))
xrisk = df[df['Risk'] == 'At Risk']['Price Index']
yrisk = df[df['Risk'] == 'At Risk']['Data_Value']
xnot = df[df['Risk'] == 'Not At Risk']['Price Index']
ynot = df[df['Risk'] == 'Not At Risk']['Data_Value']
sns.regplot(x=xrisk, y=yrisk, label='At Risk')
sns.regplot(x=xnot, y=ynot, label='Not At Risk')
plt.title(region + ' Rate of Adults Engaging in no Leisure-Time Physical Acticity for At Risk and Not At Risk Groups Compared to Indexed Price of ' + food)
plt.xlabel('Indexed Price of ' + food)
plt.ylabel('Percent of Inactive Adults')
plt.legend()
plt.show()
Step 10
We will now see the trends specifically for the at risk groups mentioned before. To do this we will continue with our filtered df, but we will only graph one set of points of the percent response rate over the price index of food.
When looking at the graphs produced by the obesity question, we see all groups have a similar pronounced increase in obesity rates when compared to the price index of food increasing. This is more prononuced than the national graph produced way back in the page.
When looking at the graphs produced by the inactivity question, we see much different trends. For income, it is noticable that as price indices of food increased, the rate of increase in inactivity rates seemed to decrease, suggesting that as income increased, activity rates decreased when food prices went up. For Black and Hispanic adults, we see a new trend. We see that Black adults become much more inactive as price increases, while Hispanic adults only barely become more inactive as price increases
for group in risk_groups:
df = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == food) & (merged['Question'] == 'Percent of adults aged 18 years and older who have obesity') & (merged['Stratification1'] == group)].groupby(['Year', 'Stratification1', 'Percent change', 'Price Index']).agg({
'Data_Value': 'mean'
}).reset_index()
plt.figure(figsize=(10, 10))
x = df['Price Index']
y = df['Data_Value']
sns.regplot(x=x, y=y)
plt.title('National Rate of Adults (' + group + ') Engaging in no Leisure-Time Physical Acticity for At Risk and Not At Risk Groups Compared to Indexed Price of ' + food)
plt.title('National Obesity Rates of Adults (' + group + ') Compared to Indexed Price of food')
plt.xlabel('Indexed Price of food')
plt.ylabel('Obesity Percentage')
plt.show()
for group in risk_groups:
df = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == food) & (merged['Question'] == 'Percent of adults who engage in no leisure-time physical activity') & (merged['Stratification1'] == group)].groupby(['Year', 'Stratification1', 'Percent change', 'Price Index']).agg({
'Data_Value': 'mean'
}).reset_index()
plt.figure(figsize=(10, 10))
x = df['Price Index']
y = df['Data_Value']
sns.regplot(x=x, y=y)
plt.title('National Rate of Adults (' + group + ') Engaging in no Leisure-Time Physical Acticity for At Risk and Not At Risk Groups Compared to Indexed Price of ' + food)
plt.xlabel('Indexed Price of Food')
plt.ylabel('Percent of Inactive Adults')
plt.show()
Data AnalysisΒΆ
We will now begin to work on more advanced methods on analyzing our data. We will begin by looking at a linear regression model, to see if our socioeconomic and food price factors have linear relationships with obesity and inactivity rates of adults.
Our goal is to see if we can find statistically significant factors that put individuals at risk for higher obesity rates.
Step 1
To begin, we are going to reduce our dataframe to our columns of interest in the model. We are then going to split the merged dataframe into two, one including only obesity response rates, and the other inactivity response rates.
We will begin with the obesity linear model. To begin we will look at every variable and see the total effect of each variable: year, strat groups, risk group, region, price index of food, and percent change.
Looking at the results of this first model, we see some glaring issues. Some coefficients are extreme, and our condition number is enormous. This means that the model is not doing a simple unbiased linear regression, there exists large amounts of covariance and multicollinearity. This means that our independent variables are affecting each other before impacting the actual percent of obesity. We will need to change our model
merged.columns = ['Year', 'Question', 'Strat', 'Region', 'Data_Value', 'Risk', 'CPI_item', 'Percent_Change', 'Price_Index']
obese = merged[merged['Question'] == 'Percent of adults aged 18 years and older who have obesity']
inactive = merged[merged['Question'] == 'Percent of adults who engage in no leisure-time physical activity']
linmodel = sm.ols(formula='Data_Value ~ Year + Strat + Risk + Region + Price_Index + Percent_Change', data=obese).fit()
print(linmodel.summary2())
Results: Ordinary least squares
====================================================================================================
Model: OLS Adj. R-squared: 0.901
Dependent Variable: Data_Value AIC: 179850.4568
Date: 2024-05-18 18:00 BIC: 180152.7134
No. Observations: 41602 Log-Likelihood: -89890.
Df Model: 34 F-statistic: 1.118e+04
Df Residuals: 41567 Prob (F-statistic): 0.00
R-squared: 0.901 Scale: 4.4123
----------------------------------------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------------
Intercept -1179.6479 7.4115 -159.1644 0.0000 -1194.1746 -1165.1211
Strat[T.$25,000 - $34,999] -1.6703 0.0746 -22.3780 0.0000 -1.8166 -1.5240
Strat[T.$35,000 - $49,999] -2.1223 0.0746 -28.4344 0.0000 -2.2686 -1.9760
Strat[T.$50,000 - $74,999] 1.4619 0.0517 28.2943 0.0000 1.3606 1.5632
Strat[T.$75,000 or greater] -2.4163 0.0517 -46.7654 0.0000 -2.5176 -2.3150
Strat[T.18 - 24] -12.1254 0.0517 -234.6788 0.0000 -12.2267 -12.0241
Strat[T.2 or more races] 1.7661 0.0517 34.1813 0.0000 1.6648 1.8674
Strat[T.25 - 34] -1.0108 0.0517 -19.5624 0.0000 -1.1120 -0.9095
Strat[T.35 - 44] 3.8987 0.0517 75.4569 0.0000 3.7974 4.0000
Strat[T.45 - 54] 5.9858 0.0517 115.8514 0.0000 5.8846 6.0871
Strat[T.55 - 64] 4.8285 0.0517 93.4516 0.0000 4.7272 4.9297
Strat[T.65 or older] -2.2341 0.0517 -43.2387 0.0000 -2.3353 -2.1328
Strat[T.American Indian/Alaska Native] 6.6008 0.0517 127.7531 0.0000 6.4995 6.7020
Strat[T.Asian] -19.2597 0.0517 -372.7568 0.0000 -19.3609 -19.1584
Strat[T.College graduate] -5.8086 0.0517 -112.4204 0.0000 -5.9098 -5.7073
Strat[T.Female] 0.2193 0.0517 4.2438 0.0000 0.1180 0.3205
Strat[T.Hawaiian/Pacific Islander] 9.1824 0.0838 109.5361 0.0000 9.0181 9.3467
Strat[T.High school graduate] 2.3674 0.0517 45.8202 0.0000 2.2662 2.4687
Strat[T.Hispanic] -1.4084 0.0746 -18.8686 0.0000 -1.5546 -1.2621
Strat[T.Less than $15,000] 1.3337 0.0746 17.8690 0.0000 1.1874 1.4800
Strat[T.Less than high school] 4.4957 0.0517 87.0104 0.0000 4.3944 4.5969
Strat[T.Male] 0.1802 0.0517 3.4885 0.0005 0.0790 0.2815
Strat[T.Non-Hispanic Black] 3.6842 0.0746 49.3590 0.0000 3.5379 3.8305
Strat[T.Non-Hispanic White] -1.3819 0.0517 -26.7458 0.0000 -1.4832 -1.2806
Strat[T.Other] -2.9152 0.0547 -53.3133 0.0000 -3.0224 -2.8081
Strat[T.Some college or technical school] 1.9197 0.0517 37.1550 0.0000 1.8185 2.0210
Strat[T.Total] 0.2071 0.0517 4.0088 0.0001 0.1059 0.3084
Risk[T.Not At Risk] -4.0382 0.0517 -78.1566 0.0000 -4.1395 -3.9369
Region[T.National] -2.2397 0.0358 -62.5842 0.0000 -2.3099 -2.1696
Region[T.Northeast] -3.7239 0.0359 -103.6718 0.0000 -3.7943 -3.6535
Region[T.Southeast] 0.1112 0.0359 3.0926 0.0020 0.0407 0.1816
Region[T.Southwest] -0.8441 0.0360 -23.4656 0.0000 -0.9146 -0.7736
Region[T.West] -4.4237 0.0358 -123.6098 0.0000 -4.4938 -4.3535
Year 0.6030 0.0037 162.2880 0.0000 0.5957 0.6103
Price_Index -0.0023 0.0013 -1.8333 0.0668 -0.0049 0.0002
Percent_Change 0.0202 0.0027 7.4551 0.0000 0.0149 0.0256
----------------------------------------------------------------------------------------------------
Omnibus: 9351.181 Durbin-Watson: 0.097
Prob(Omnibus): 0.000 Jarque-Bera (JB): 456937.672
Skew: 0.128 Prob(JB): 0.000
Kurtosis: 19.234 Condition No.: 32869954963714204
====================================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.57e-22. This might indicate that there are strong
multicollinearity problems or that the design matrix is singular.
Step 2
We will look to see the covariance between our numerical values. We see that year has a high covariance with every other value, meaning it could be affecting things other than obesity rate. We also see a high covariance between prince_index and percent_change, which is to be expected, they are formed from each other.
To fix our model, we are going to remove year from our equation, as well as separating price index and percent change.
When looking at the new model with only price index, our model becomes much more realistic, with the condition number coming down to 3156. That is still not ideal, but it is workable. We see almost every variable is significant, save for the 35-44 group, and the Southeast region. We see that the Asian, and 18-24 groups are strongly impacting the obesity rate, having the largest decrease in rates.
When looking at the model with percent change, it becomes even more manageable, with a condition number of 145. This model follows similar trends to the previous one with price index, but it suggests that Percent Change is less statistically significant than Price Index, with a lower t value.
obese[['Year', 'Data_Value', 'Percent_Change', 'Price_Index']].corr()
| Year | Data_Value | Percent_Change | Price_Index | |
|---|---|---|---|---|
| Year | 1.000000 | 0.313128 | 0.218156 | 0.582468 |
| Data_Value | 0.313128 | 1.000000 | 0.081219 | 0.186722 |
| Percent_Change | 0.218156 | 0.081219 | 1.000000 | 0.538754 |
| Price_Index | 0.582468 | 0.186722 | 0.538754 | 1.000000 |
olinmodel = sm.ols(formula='Data_Value ~ Strat + Region + Price_Index', data=obese).fit()
print(olinmodel.summary2())
Results: Ordinary least squares
==============================================================================================
Model: OLS Adj. R-squared: 0.838
Dependent Variable: Data_Value AIC: 200409.3330
Date: 2024-05-18 18:07 BIC: 200694.3178
No. Observations: 41602 Log-Likelihood: -1.0017e+05
Df Model: 32 F-statistic: 6737.
Df Residuals: 41569 Prob (F-statistic): 0.00
R-squared: 0.838 Scale: 7.2328
----------------------------------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
Intercept 23.8564 0.1508 158.1780 0.0000 23.5607 24.1520
Strat[T.$25,000 - $34,999] -1.6703 0.0956 -17.4784 0.0000 -1.8576 -1.4830
Strat[T.$35,000 - $49,999] -2.1223 0.0956 -22.2087 0.0000 -2.3096 -1.9350
Strat[T.$50,000 - $74,999] -2.5763 0.0956 -26.9590 0.0000 -2.7636 -2.3890
Strat[T.$75,000 or greater] -6.4545 0.0956 -67.5414 0.0000 -6.6418 -6.2672
Strat[T.18 - 24] -16.1636 0.0956 -169.1404 0.0000 -16.3509 -15.9763
Strat[T.2 or more races] -2.2721 0.0956 -23.7761 0.0000 -2.4594 -2.0848
Strat[T.25 - 34] -5.0490 0.0956 -52.8336 0.0000 -5.2363 -4.8617
Strat[T.35 - 44] -0.1395 0.0956 -1.4596 0.1444 -0.3268 0.0478
Strat[T.45 - 54] 1.9476 0.0956 20.3804 0.0000 1.7603 2.1349
Strat[T.55 - 64] 0.7903 0.0956 8.2695 0.0000 0.6030 0.9776
Strat[T.65 or older] -6.2723 0.0956 -65.6346 0.0000 -6.4596 -6.0850
Strat[T.American Indian/Alaska Native] 2.5626 0.0956 26.8153 0.0000 2.3753 2.7499
Strat[T.Asian] -23.2979 0.0956 -243.7948 0.0000 -23.4852 -23.1106
Strat[T.College graduate] -9.8468 0.0956 -103.0391 0.0000 -10.0341 -9.6595
Strat[T.Female] -3.8189 0.0956 -39.9624 0.0000 -4.0062 -3.6316
Strat[T.Hawaiian/Pacific Islander] 5.1237 0.1303 39.3077 0.0000 4.8682 5.3791
Strat[T.High school graduate] -1.6708 0.0956 -17.4833 0.0000 -1.8581 -1.4835
Strat[T.Hispanic] -1.4084 0.0956 -14.7374 0.0000 -1.5957 -1.2210
Strat[T.Less than $15,000] 1.3337 0.0956 13.9566 0.0000 1.1464 1.5210
Strat[T.Less than high school] 0.4575 0.0956 4.7869 0.0000 0.2702 0.6448
Strat[T.Male] -3.8580 0.0956 -40.3707 0.0000 -4.0453 -3.6707
Strat[T.Non-Hispanic Black] 3.6842 0.0956 38.5520 0.0000 3.4969 3.8715
Strat[T.Non-Hispanic White] -5.4201 0.0956 -56.7175 0.0000 -5.6074 -5.2328
Strat[T.Other] -7.0591 0.0985 -71.6405 0.0000 -7.2523 -6.8660
Strat[T.Some college or technical school] -2.1185 0.0956 -22.1683 0.0000 -2.3058 -1.9312
Strat[T.Total] -3.8311 0.0956 -40.0894 0.0000 -4.0184 -3.6438
Region[T.National] -2.2368 0.0458 -48.8181 0.0000 -2.3266 -2.1470
Region[T.Northeast] -3.7186 0.0460 -80.8570 0.0000 -3.8087 -3.6285
Region[T.Southeast] 0.1086 0.0460 2.3603 0.0183 0.0184 0.1988
Region[T.Southwest] -0.8420 0.0461 -18.2811 0.0000 -0.9323 -0.7517
Region[T.West] -4.4208 0.0458 -96.4823 0.0000 -4.5106 -4.3309
Price_Index 0.1063 0.0011 93.3431 0.0000 0.1041 0.1086
----------------------------------------------------------------------------------------------
Omnibus: 4811.513 Durbin-Watson: 0.269
Prob(Omnibus): 0.000 Jarque-Bera (JB): 49004.552
Skew: 0.070 Prob(JB): 0.000
Kurtosis: 8.315 Condition No.: 3156
==============================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
olinmodel = sm.ols(formula='Data_Value ~ Strat + Region + Percent_Change', data=obese).fit()
print(olinmodel.summary2())
Results: Ordinary least squares
==============================================================================================
Model: OLS Adj. R-squared: 0.811
Dependent Variable: Data_Value AIC: 206972.1435
Date: 2024-05-18 18:07 BIC: 207257.1283
No. Observations: 41602 Log-Likelihood: -1.0345e+05
Df Model: 32 F-statistic: 5564.
Df Residuals: 41569 Prob (F-statistic): 0.00
R-squared: 0.811 Scale: 8.4687
----------------------------------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
Intercept 35.8061 0.0804 445.4667 0.0000 35.6486 35.9637
Strat[T.$25,000 - $34,999] -1.6703 0.1034 -16.1527 0.0000 -1.8730 -1.4676
Strat[T.$35,000 - $49,999] -2.1223 0.1034 -20.5243 0.0000 -2.3250 -1.9197
Strat[T.$50,000 - $74,999] -2.5763 0.1034 -24.9143 0.0000 -2.7790 -2.3736
Strat[T.$75,000 or greater] -6.4545 0.1034 -62.4187 0.0000 -6.6572 -6.2518
Strat[T.18 - 24] -16.1636 0.1034 -156.3118 0.0000 -16.3663 -15.9609
Strat[T.2 or more races] -2.2721 0.1034 -21.9728 0.0000 -2.4748 -2.0694
Strat[T.25 - 34] -5.0490 0.1034 -48.8264 0.0000 -5.2516 -4.8463
Strat[T.35 - 44] -0.1395 0.1034 -1.3489 0.1774 -0.3422 0.0632
Strat[T.45 - 54] 1.9476 0.1034 18.8346 0.0000 1.7449 2.1503
Strat[T.55 - 64] 0.7903 0.1034 7.6423 0.0000 0.5876 0.9929
Strat[T.65 or older] -6.2723 0.1034 -60.6565 0.0000 -6.4749 -6.0696
Strat[T.American Indian/Alaska Native] 2.5626 0.1034 24.7815 0.0000 2.3599 2.7652
Strat[T.Asian] -23.2979 0.1034 -225.3040 0.0000 -23.5005 -23.0952
Strat[T.College graduate] -9.8468 0.1034 -95.2240 0.0000 -10.0494 -9.6441
Strat[T.Female] -3.8189 0.1034 -36.9314 0.0000 -4.0216 -3.6163
Strat[T.Hawaiian/Pacific Islander] 5.1153 0.1410 36.2664 0.0000 4.8388 5.3917
Strat[T.High school graduate] -1.6708 0.1034 -16.1573 0.0000 -1.8734 -1.4681
Strat[T.Hispanic] -1.4084 0.1034 -13.6196 0.0000 -1.6110 -1.2057
Strat[T.Less than $15,000] 1.3337 0.1034 12.8980 0.0000 1.1311 1.5364
Strat[T.Less than high school] 0.4575 0.1034 4.4239 0.0000 0.2548 0.6601
Strat[T.Male] -3.8580 0.1034 -37.3088 0.0000 -4.0606 -3.6553
Strat[T.Non-Hispanic Black] 3.6842 0.1034 35.6280 0.0000 3.4815 3.8868
Strat[T.Non-Hispanic White] -5.4201 0.1034 -52.4157 0.0000 -5.6228 -5.2174
Strat[T.Other] -7.1936 0.1066 -67.4722 0.0000 -7.4025 -6.9846
Strat[T.Some college or technical school] -2.1185 0.1034 -20.4869 0.0000 -2.3212 -1.9158
Strat[T.Total] -3.8311 0.1034 -37.0488 0.0000 -4.0338 -3.6284
Region[T.National] -2.2346 0.0496 -45.0717 0.0000 -2.3318 -2.1375
Region[T.Northeast] -3.7158 0.0498 -74.6681 0.0000 -3.8133 -3.6183
Region[T.Southeast] 0.1090 0.0498 2.1885 0.0286 0.0114 0.2066
Region[T.Southwest] -0.8401 0.0498 -16.8570 0.0000 -0.9378 -0.7424
Region[T.West] -4.4186 0.0496 -89.1207 0.0000 -4.5158 -4.3214
Percent_Change 0.1164 0.0031 37.0790 0.0000 0.1102 0.1225
----------------------------------------------------------------------------------------------
Omnibus: 3474.612 Durbin-Watson: 0.080
Prob(Omnibus): 0.000 Jarque-Bera (JB): 22515.428
Skew: -0.004 Prob(JB): 0.000
Kurtosis: 6.604 Condition No.: 145
==============================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Step 3
We want to see a visual representation of the reliability of the model. To do this we plot the residuals (error terms in predicted and actual values of the model) and the fitted values of the model. We see a very large range of residuals, and a funnel shape in the graph, indicating that this model is missing another data that impacts obesity rates.
Looking at all of this, we can assume that this data does not follow a linear relationship. We are not going to bother with the inactivity rates with a linear model, as it shares the same independent values and the errors that come with them. We are going to move to a different type of model
plt.figure(figsize=(8,6))
plt.scatter(olinmodel.fittedvalues, olinmodel.resid)
plt.axhline(y=0, color='orange', linestyle='-')
plt.title('Obesity Linear Model Fitted Values over Residuals')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()
Step 4
We are going to use a random forest to model this data. Random forests are helpful in that they can help deal with models that are not linear in nature. Random forests use a set of decision trees to help predict values based on independent values. It will help with the classification nature of our data, between strata and region. The random forest will also help show which variables are most influential in affecting obesity rates.
To do this, we will use sklearns and statsmodels frameworks. Our first step will be to one hot encode our data. This means that we will replace our categorical data into arrays that put a 1 for the category they match, and 0 elsewhere. This will create a new column for each stratification data. We will then scale our date to use the same unit, removing unit confusion in our data. We will separate our data into a training and testing set, to see a better representation of data in the set.
To best see if our random forest is accurate, we will take the mean squared error of its results, which compare the difference between its predicted obesity/inactivity rates compared to the actual data for each row, as well as listing out and graphing our feature importance, to see if some are over powering others, as well as graphingour residuals and predicted vs actual data.
Upon our first random forest for the obesity set, we see a relatively small mean squared error of 3, meaning we are on average off by about 3% in percentages (which is less than 10% of average values). We also see we have two very strong features: Asian, and 18-24. We saw this in the linear model as well. These two values are outliers, and may be negatively impacting our data, especially as these groups have the lowest obesity rates in the country. We will then filter these two out for another run of our random forest model.
We also see that our predicted values follow the actual values at a decent rate in our second graph. Our goal is for our data to be as close to the orange line as possible, and it is much close than the linear model.
# a function to allow for easy repeatability
def rf_analysis(df, data):
#scale our data and create a list of our features, including the new dummy ones
scaler = StandardScaler()
feats = ['Percent_Change', 'Price_Index'] + [c for c in df.columns if c.startswith('Strat') or c.startswith('Region') or c.startswith('Risk')]
x = scaler.fit_transform(df[feats])
y = df['Data_Value'].values
#split our data, 90% training and 10% to test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1, random_state=30)
#create our forest and calc our MSE
forest = RandomForestRegressor(n_estimators=100, random_state=30)
forest.fit(x_train, y_train)
y_pred = forest.predict(x_test)
mse = mean_squared_error(y_pred, y_test)
print(f"Mean Square Error of {data} Random Forest Model: {mse}")
#list out the importance for each feature
forest_list = forest.feature_importances_
forest_df = pd.DataFrame({'Feature': feats, 'Importance': forest_list})
print(forest_df)
#graph our importance
plt.figure(figsize=(8,6))
plt.barh(forest_df['Feature'], forest_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance in the ' + data + ' Random Forest')
plt.show()
#graph the actual over predicted values, as well as a line y = x
plt.figure(figsize=(8,6))
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='orange')
plt.title('Actual Over Predicted Values of the ' + data + ' Random Forest')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.show()
#graph our residuals over actual values
plt.figure(figsize=(8,6))
forest_resid = y_test - y_pred
plt.scatter(y_pred, forest_resid)
plt.axhline(y=0, color='orange', linestyle='-')
plt.title('Residuals Over Predicted Values of the ' + data + ' Random Forest')
plt.xlabel('Actual Values')
plt.ylabel('Residuals')
plt.show()
odummies = pd.get_dummies(obese, columns=['Risk', 'Strat', 'Region'], drop_first=True)
rf_analysis(odummies, 'Obesity')
Mean Square Error of Obesity Random Forest Model: 3.302869382302935
Feature Importance
0 Percent_Change 0.051929
1 Price_Index 0.083600
2 Risk_Not At Risk 0.009302
3 Strat_$25,000 - $34,999 0.001307
4 Strat_$35,000 - $49,999 0.001561
5 Strat_$50,000 - $74,999 0.001267
6 Strat_$75,000 or greater 0.015355
7 Strat_18 - 24 0.159959
8 Strat_2 or more races 0.003125
9 Strat_25 - 34 0.006206
10 Strat_35 - 44 0.005375
11 Strat_45 - 54 0.014326
12 Strat_55 - 64 0.007546
13 Strat_65 or older 0.014492
14 Strat_American Indian/Alaska Native 0.021585
15 Strat_Asian 0.343933
16 Strat_College graduate 0.052354
17 Strat_Female 0.001844
18 Strat_Hawaiian/Pacific Islander 0.036879
19 Strat_High school graduate 0.002026
20 Strat_Hispanic 0.001652
21 Strat_Less than $15,000 0.007632
22 Strat_Less than high school 0.007384
23 Strat_Male 0.002016
24 Strat_Non-Hispanic Black 0.026632
25 Strat_Non-Hispanic White 0.009292
26 Strat_Other 0.020001
27 Strat_Some college or technical school 0.001469
28 Strat_Total 0.001760
29 Region_National 0.011045
30 Region_Northeast 0.026337
31 Region_Southeast 0.013147
32 Region_Southwest 0.008375
33 Region_West 0.029284
Step 5
We first get rid of our outlier groups, and then run our random forest model again to get a better picture of impactful variables.
We find that our two most impactful are actually the price index and percent change in food prices. We even see this with a similarly small MSE of 3.45, which indicates the model is pretty accurate. Our graphs also show that the predicted values of our model do not stray too far from the actual values as well. We are able to now see our most impactful variables for obesity rates for adults.
These are the food prices, as well as several other strata groups mentioned in the risk groups way back in the document.
obese_outliers_gone = obese[~((obese['Strat'] == 'Asian') | (obese['Strat'] == '18 - 24'))]
oo_dummies = pd.get_dummies(obese_outliers_gone, columns=['Strat', 'Region', 'Risk'], drop_first=True)
rf_analysis(oo_dummies, 'Obesity')
Mean Square Error of Obesity Random Forest Model: 3.4521238743072393
Feature Importance
0 Percent_Change 0.100906
1 Price_Index 0.162334
2 Strat_$25,000 - $34,999 0.002816
3 Strat_$35,000 - $49,999 0.003403
4 Strat_$50,000 - $74,999 0.002643
5 Strat_$75,000 or greater 0.030941
6 Strat_2 or more races 0.006292
7 Strat_25 - 34 0.012959
8 Strat_35 - 44 0.010933
9 Strat_45 - 54 0.030708
10 Strat_55 - 64 0.015340
11 Strat_65 or older 0.030186
12 Strat_American Indian/Alaska Native 0.043297
13 Strat_College graduate 0.108267
14 Strat_Female 0.003861
15 Strat_Hawaiian/Pacific Islander 0.076447
16 Strat_High school graduate 0.004224
17 Strat_Hispanic 0.003336
18 Strat_Less than $15,000 0.014805
19 Strat_Less than high school 0.014992
20 Strat_Male 0.004255
21 Strat_Non-Hispanic Black 0.053239
22 Strat_Non-Hispanic White 0.018548
23 Strat_Other 0.040477
24 Strat_Some college or technical school 0.003012
25 Strat_Total 0.003681
26 Region_National 0.021995
27 Region_Northeast 0.053997
28 Region_Southeast 0.024626
29 Region_Southwest 0.014480
30 Region_West 0.061191
31 Risk_Not At Risk 0.021810
Step 6
We will now repeat our analysis for our inactivity dataframe. We see much differnt importance values for inactivity rates. We see the biggest identifiers for inactivity levels are being in our listed risk group, having less than a high school degree, and being high income. We have seen some of these trends before, with higher income individuals decreasing their activity rates. This model is also pretty accurate, with an even smaller MSE of 2.61, with our graphs being more tighlty bound to the ideal rates as well.
inactive_outliers_gone = inactive[~((inactive['Strat'] == 'Asian') | (inactive['Strat'] == '18 - 24'))]
io_dummies = pd.get_dummies(inactive_outliers_gone, columns=['Strat', 'Region', 'Risk'], drop_first=True)
rf_analysis(io_dummies, 'Inactivity')
Mean Square Error of Inactivity Random Forest Model: 2.612063666843684
Feature Importance
0 Percent_Change 0.034283
1 Price_Index 0.041326
2 Strat_$25,000 - $34,999 0.008151
3 Strat_$35,000 - $49,999 0.019487
4 Strat_$50,000 - $74,999 0.003909
5 Strat_$75,000 or greater 0.082610
6 Strat_2 or more races 0.001310
7 Strat_25 - 34 0.006513
8 Strat_35 - 44 0.000389
9 Strat_45 - 54 0.006509
10 Strat_55 - 64 0.017036
11 Strat_65 or older 0.044173
12 Strat_American Indian/Alaska Native 0.023937
13 Strat_College graduate 0.091810
14 Strat_Female 0.009938
15 Strat_Hawaiian/Pacific Islander 0.006280
16 Strat_High school graduate 0.032447
17 Strat_Hispanic 0.004697
18 Strat_Less than $15,000 0.051601
19 Strat_Less than high school 0.195045
20 Strat_Male 0.000254
21 Strat_Non-Hispanic Black 0.004278
22 Strat_Non-Hispanic White 0.000520
23 Strat_Other 0.004468
24 Strat_Some college or technical school 0.000377
25 Strat_Total 0.004040
26 Region_National 0.002896
27 Region_Northeast 0.005725
28 Region_Southeast 0.020076
29 Region_Southwest 0.006600
30 Region_West 0.064367
31 Risk_Not At Risk 0.204947
InsightΒΆ
As a whole, the rates of obesity in American adults have been worrying as of late. They have increased over the past several years, and show no sign of stopping in the near future. With the mentioned health risks that come with obesity, actions to help reduce these rates would be highly beneficial to the American population.
Through our analysis, we have found that certain factors of an individual's status or environment can have large impacts on their chance to being obese, or being inactive. We have found that certain groups like Black adults, Hispanic adults, and low-income individuals are very prone to being obese. Analysis in this data has also shown that food price change has increased effects for these groups, meaning higher food prices could lead to higher obesity rates for these groups. Policy that can help reduce food prices in areas predominantly populated by these risk groups could be beneficial in reducing their obesity rates. Other geographical divides could also be mended. Seeing that the Western region has the lowest obesity rates in the country, other regions could try to follow certain policies that the West implements. This could help lower their obesity rates as well.
Looking at inactivity rates, other groups are at risk. Individuals with high income levels, and lower education levels could also be at risk for being more inactive. Considering that physical activity is a crucial element in reducing obesity rates, public policy could be implemented to help encourage more activity. This could include subsidizing gym memberships, increasing open spaces like parks in areas with low activity levels, or providing more free time to American adults.
With inflation of food prices seemingly unending, it is important for policy makers to find ways to limit the increase of food prices. This analysis has helped show that increases in the price of food can negatively impact obesity and inactivity rates, so it is an important thing to consider when looking at the health of American adults.